Himanshu Kukreja
0%
LearnSystem DesignWeek 10Incident Management
Day 05

Week 10 — Day 5: Incident Management

System Design Mastery Series — Production Readiness and Operational Excellence


Preface

You've built everything right. SLOs defined. Observability in place. Safe deployments. Capacity planned.

And yet, at 3 AM on a Tuesday, your phone still rings.

THE INCIDENT

3:17 AM. Your phone buzzes.

PAGERDUTY: [CRITICAL] Payment service - Error rate 45%

Your heart rate spikes. You're awake now.

What happens next determines:
├── How long customers are affected
├── How much revenue is lost
├── Whether this becomes a headline
├── Whether you learn and improve
└── Whether it happens again

BAD INCIDENT RESPONSE:
├── Panic
├── Blame
├── Multiple people changing things at once
├── No communication
├── "Fixed" but nobody knows how
├── Happens again in a month

GOOD INCIDENT RESPONSE:
├── Calm, structured approach
├── Clear roles and responsibilities
├── Coordinated troubleshooting
├── Stakeholders informed
├── Root cause identified
├── Systemic fix implemented
├── Lessons shared widely

The difference is not luck. It's preparation.

Today, we learn to handle the inevitable failures — and more importantly, to learn from them.


Part I: Foundations

Chapter 1: Incident Response Fundamentals

1.1 What Is an Incident?

INCIDENT DEFINITION

An incident is an unplanned disruption or degradation of service
that requires an immediate response.

NOT every alert is an incident:
├── Alert: "CPU usage at 75%" → Monitoring, not incident
├── Alert: "Error rate 0.5% (SLO: 1%)" → Warning, not incident
├── Alert: "Error rate 5% (SLO: 1%)" → INCIDENT

SEVERITY LEVELS:

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  SEV 1 - CRITICAL                                                      │
│  ─────────────────                                                     │
│  Complete service outage affecting all users                           │
│  ├── Revenue impact: > $10,000/hour                                    │
│  ├── Response: Immediate, all-hands                                    │
│  ├── Communication: Exec notification, status page                     │
│  └── Example: Checkout completely down                                 │
│                                                                        │
│  SEV 2 - HIGH                                                          │
│  ────────────                                                          │
│  Major feature degraded or subset of users affected                    │
│  ├── Revenue impact: $1,000-$10,000/hour                               │
│  ├── Response: Immediate, on-call + backup                             │
│  ├── Communication: Internal stakeholders                              │
│  └── Example: Payment processing slow (2x latency)                     │ 
│                                                                        │
│  SEV 3 - MEDIUM                                                        │
│  ──────────────                                                        │
│  Minor feature degraded, workaround exists                             │
│  ├── Revenue impact: < $1,000/hour                                     │
│  ├── Response: Business hours OK                                       │
│  ├── Communication: Team notification                                  │
│  └── Example: Order history page slow                                  │
│                                                                        │
│  SEV 4 - LOW                                                           │
│  ───────────                                                           │
│  Minor issue, no user impact                                           │
│  ├── Revenue impact: None                                              │
│  ├── Response: Normal ticket queue                                     │
│  ├── Communication: None required                                      │
│  └── Example: Internal tool slightly degraded                          │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

1.2 Incident Lifecycle

INCIDENT LIFECYCLE

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  1. DETECTION                                                          │
│     ────────────                                                       │
│     How we know something is wrong                                     │
│     ├── Automated alerts (preferred)                                   │
│     ├── Customer reports                                               │
│     └── Internal reports                                               │
│                                                                        │
│                    │                                                   │
│                    ▼                                                   │
│                                                                        │
│  2. TRIAGE                                                             │
│     ──────────                                                         │
│     Assess severity and assign resources                               │
│     ├── What is the impact?                                            │
│     ├── Who is affected?                                               │
│     ├── What's the severity level?                                     │
│     └── Who needs to be involved?                                      │
│                                                                        │
│                    │                                                   │
│                    ▼                                                   │
│                                                                        │
│  3. RESPONSE                                                           │
│     ────────────                                                       │
│     Coordinate the fix                                                 │
│     ├── Assign roles (IC, Comms, Technical)                            │
│     ├── Diagnose the issue                                             │
│     ├── Implement mitigation                                           │
│     └── Communicate status                                             │
│                                                                        │
│                    │                                                   │
│                    ▼                                                   │
│                                                                        │
│  4. MITIGATION                                                         │
│     ──────────────                                                     │
│     Stop the bleeding                                                  │
│     ├── Restore service (rollback, failover, etc.)                     │
│     ├── May not be a complete fix                                      │
│     └── Priority: users over root cause                                │
│                                                                        │
│                    │                                                   │
│                    ▼                                                   │
│                                                                        │
│  5. RESOLUTION                                                         │
│     ──────────────                                                     │
│     Confirm service is restored                                        │
│     ├── Verify metrics are normal                                      │
│     ├── Confirm no ongoing impact                                      │
│     └── Stand down responders                                          │
│                                                                        │
│                    │                                                   │
│                    ▼                                                   │
│                                                                        │
│  6. POST-INCIDENT                                                      │
│     ───────────────                                                    │
│     Learn and improve                                                  │
│     ├── Write postmortem                                               │
│     ├── Identify root cause                                            │
│     ├── Create action items                                            │
│     └── Share learnings                                                │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

KEY METRICS:

TTD (Time to Detect): How long until we knew?
  Target: < 5 minutes for SEV1

TTE (Time to Engage): How long until someone started working?
  Target: < 15 minutes for SEV1

TTM (Time to Mitigate): How long until impact stopped?
  Target: < 1 hour for SEV1

TTR (Time to Resolve): How long until fully fixed?
  May be longer than mitigation (that's OK)

1.3 Incident Roles

INCIDENT RESPONSE ROLES

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  INCIDENT COMMANDER (IC)                                               │
│  ────────────────────────                                              │
│  The single point of coordination.                                     │
│                                                                        │
│  Responsibilities:                                                     │
│  ├── Declare incident severity                                         │
│  ├── Coordinate response efforts                                       │
│  ├── Make decisions (or delegate clearly)                              │
│  ├── Ensure communication is happening                                 │
│  ├── Know who's doing what                                             │
│  └── Decide when incident is resolved                                  │
│                                                                        │
│  NOT responsible for:                                                  │
│  ├── Debugging the issue themselves                                    │
│  └── Making all technical decisions                                    │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  TECHNICAL LEAD                                                        │
│  ───────────────                                                       │
│  Leads the technical investigation.                                    │
│                                                                        │
│  Responsibilities:                                                     │
│  ├── Diagnose the root cause                                           │
│  ├── Propose mitigation options                                        │
│  ├── Implement or delegate fixes                                       │
│  ├── Keep IC informed of progress                                      │
│  └── Verify fix is effective                                           │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  COMMUNICATIONS LEAD                                                   │
│  ───────────────────                                                   │
│  Manages stakeholder communication.                                    │
│                                                                        │
│  Responsibilities:                                                     │
│  ├── Update status page                                                │
│  ├── Notify internal stakeholders                                      │
│  ├── Draft customer communications                                     │
│  ├── Keep timeline of events                                           │
│  └── Handle incoming questions                                         │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  SUBJECT MATTER EXPERTS (SMEs)                                         │
│  ──────────────────────────────                                        │
│  Brought in as needed for specific expertise.                          │
│                                                                        │
│  Examples:                                                             │
│  ├── Database expert for DB issues                                     │
│  ├── Payment team for payment issues                                   │
│  ├── Security team for security incidents                              │
│  └── Vendor contacts for third-party issues                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

For smaller incidents (SEV3/4):
One person may fill multiple roles.

For larger incidents (SEV1/2):
Clear role separation is critical.

Chapter 2: On-Call Best Practices

2.1 On-Call Structure

ON-CALL ROTATION DESIGN

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  ROTATION STRUCTURE                                                    │
│  ──────────────────                                                    │
│                                                                        │
│  PRIMARY ON-CALL                                                       │
│  ├── First responder for all alerts                                    │
│  ├── Available 24/7 during shift                                       │
│  ├── Response time: < 15 minutes                                       │
│  └── Rotation: 1 week typical                                          │
│                                                                        │
│  SECONDARY ON-CALL                                                     │
│  ├── Backup if primary unavailable                                     │
│  ├── Escalation for complex issues                                     │
│  ├── Response time: < 30 minutes                                       │
│  └── Less interrupted (only if needed)                                 │
│                                                                        │
│  MANAGER ESCALATION                                                    │
│  ├── For SEV1 incidents                                                │
│  ├── Business decisions                                                │
│  └── Customer communication approval                                   │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  SUSTAINABLE ON-CALL                                                   │
│  ────────────────────                                                  │
│                                                                        │
│  Rules for healthy on-call:                                            │
│  ├── Minimum 8 people in rotation                                      │
│  │   (Max 1 week in 8 weeks on-call)                                   │
│  ├── No more than 2 pages per shift (target)                           │
│  ├── Compensatory time off after busy shifts                           │
│  ├── No on-call during vacation/PTO                                    │
│  ├── Handoff meetings between shifts                                   │
│  └── Regular review of on-call load                                    │
│                                                                        │
│  WARNING SIGNS:                                                        │
│  ├── Same person on-call every week                                    │
│  ├── 5+ pages per shift                                                │
│  ├── On-call during vacation                                           │
│  ├── Burnout, turnover                                                 │
│  └── People avoiding on-call duties                                    │ 
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

2.2 Alert Design for On-Call

# incident/alerting.py

"""
Alert design for effective on-call.

Good alerts are actionable and important.
Bad alerts cause fatigue and get ignored.
"""

from dataclasses import dataclass
from typing import List, Optional
from enum import Enum


class AlertSeverity(Enum):
    PAGE = "page"          # Wake someone up
    NOTIFY = "notify"      # Slack/email during business hours
    LOG = "log"            # Just log it


@dataclass
class AlertDefinition:
    """A well-designed alert."""
    name: str
    description: str
    severity: AlertSeverity
    
    # The query/condition
    condition: str
    
    # How long condition must be true before alerting
    duration: str
    
    # Runbook for responders
    runbook_url: str
    
    # Who to notify
    notify_channel: str
    
    # Tags for filtering
    tags: List[str]


# =============================================================================
# GOOD ALERT EXAMPLES
# =============================================================================

GOOD_ALERTS = [
    # This alert is ACTIONABLE and IMPORTANT
    AlertDefinition(
        name="CheckoutErrorRateHigh",
        description="Checkout error rate exceeds SLO, customers cannot complete purchases",
        severity=AlertSeverity.PAGE,
        condition='sum(rate(checkout_errors_total[5m])) / sum(rate(checkout_requests_total[5m])) > 0.01',
        duration="5m",
        runbook_url="https://wiki/runbooks/checkout-errors",
        notify_channel="payments-oncall",
        tags=["checkout", "revenue", "sev1"]
    ),
    
    # This alert has clear user impact
    AlertDefinition(
        name="APILatencyP99High",
        description="API p99 latency exceeds 1 second, users experiencing slow responses",
        severity=AlertSeverity.PAGE,
        condition='histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1',
        duration="10m",
        runbook_url="https://wiki/runbooks/high-latency",
        notify_channel="api-oncall",
        tags=["latency", "user-experience"]
    ),
    
    # This is a WARNING, not a page
    AlertDefinition(
        name="DiskSpaceLow",
        description="Disk space below 20%, will need attention soon",
        severity=AlertSeverity.NOTIFY,
        condition='(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.2',
        duration="30m",
        runbook_url="https://wiki/runbooks/disk-space",
        notify_channel="infrastructure",
        tags=["disk", "capacity"]
    ),
]


# =============================================================================
# BAD ALERT EXAMPLES (DON'T DO THIS)
# =============================================================================

BAD_ALERTS = """

❌ BAD: No runbook
   AlertDefinition(
       name="SomethingWrong",
       runbook_url=None  # Responder has no idea what to do
   )

❌ BAD: Too sensitive
   AlertDefinition(
       condition='error_count > 0',  # Any single error pages someone
       duration="0m"  # No buffer for transient issues
   )

❌ BAD: Not actionable
   AlertDefinition(
       name="CPUHigh",
       condition='cpu > 80%',  # So what? What should they do?
       # High CPU might be fine if latency is OK
   )

❌ BAD: Missing context
   AlertDefinition(
       name="Error",
       description="An error occurred"  # Which error? Where? Impact?
   )

❌ BAD: Pages for non-urgent issues
   AlertDefinition(
       name="SlowBackgroundJob",
       severity=AlertSeverity.PAGE,  # This can wait until morning
       description="Nightly report generation is slow"
   )

"""


# =============================================================================
# ALERT REVIEW CHECKLIST
# =============================================================================

ALERT_REVIEW = """
Before adding an alert, ask:

1. IS IT ACTIONABLE?
   □ Is there a runbook?
   □ Can the on-call actually fix it?
   □ Or do they just have to wait?

2. IS IT URGENT?
   □ Does it need to wake someone up?
   □ Or can it wait until morning?
   □ What's the actual user impact?

3. IS IT CLEAR?
   □ Does the name explain the problem?
   □ Does the description explain the impact?
   □ Can someone unfamiliar understand it?

4. IS THE THRESHOLD RIGHT?
   □ Will it fire for real problems?
   □ Will it fire for non-problems? (false positives)
   □ Is the duration appropriate?

5. DOES IT REDUCE NOISE?
   □ Can it be consolidated with similar alerts?
   □ Is it duplicating another alert?
   □ Will it cause alert fatigue?

If you can't answer YES to all of these,
the alert needs more work.
"""

2.3 On-Call Handoff

ON-CALL HANDOFF PROCESS

The handoff between shifts is critical.
Information lost in handoff causes longer incidents.

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  HANDOFF MEETING AGENDA (15-30 minutes)                                │
│  ───────────────────────────────────────                               │
│                                                                        │
│  1. ACTIVE ISSUES (5 min)                                              │
│     ├── Any ongoing incidents?                                         │
│     ├── Any degraded services?                                         │
│     ├── Any issues being monitored?                                    │
│     └── Handoff ownership of active issues                             │
│                                                                        │
│  2. RECENT INCIDENTS (5 min)                                           │
│     ├── What happened this week?                                       │
│     ├── Any patterns or recurring issues?                              │
│     ├── Any postmortems needed?                                        │
│     └── Share context that might help                                  │
│                                                                        │
│  3. UPCOMING RISKS (5 min)                                             │
│     ├── Scheduled deployments?                                         │
│     ├── Maintenance windows?                                           │
│     ├── Expected traffic spikes?                                       │
│     └── Anything to watch for?                                         │
│                                                                        │
│  4. TOOLING/ACCESS CHECK (5 min)                                       │
│     ├── VPN working?                                                   │
│     ├── PagerDuty configured?                                          │
│     ├── Access to dashboards?                                          │
│     └── Runbooks accessible?                                           │
│                                                                        │
│  5. QUESTIONS/CONCERNS (5 min)                                         │
│     ├── Anything unclear?                                              │
│     ├── Any gaps in knowledge?                                         │
│     └── Contact info confirmed?                                        │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

HANDOFF DOCUMENT TEMPLATE:

Week of: [DATE]
Outgoing: [NAME]
Incoming: [NAME]

## Active Issues
- [None / List of issues with status]

## Recent Incidents
- [DATE] SEV2: [Brief description] - [Status: Resolved/Monitoring]

## Upcoming Risks
- [DATE] Deployment of [service]
- [DATE] Marketing campaign expected to increase traffic 2x

## Notes
- [Anything else the incoming person should know]

Chapter 3: Runbooks

3.1 Runbook Structure

RUNBOOK STRUCTURE

A runbook is a step-by-step guide for diagnosing and fixing an issue.
It should be usable by someone at 3 AM who's never seen this issue before.

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  RUNBOOK SECTIONS                                                      │
│                                                                        │
│  1. OVERVIEW                                                           │
│     ├── What is this alert?                                            │
│     ├── What is the user impact?                                       │
│     └── How urgent is it?                                              │
│                                                                        │
│  2. QUICK ASSESSMENT (< 2 minutes)                                     │
│     ├── Dashboard link                                                 │
│     ├── Key metrics to check                                           │
│     └── Is this real or false alarm?                                   │
│                                                                        │
│  3. COMMON CAUSES AND FIXES                                            │
│     ├── Cause 1: [Description]                                         │
│     │   ├── How to identify                                            │
│     │   └── How to fix                                                 │
│     ├── Cause 2: [Description]                                         │
│     │   ├── How to identify                                            │
│     │   └── How to fix                                                 │
│     └── Cause 3: [Description]                                         │
│         ├── How to identify                                            │
│         └── How to fix                                                 │
│                                                                        │
│  4. ESCALATION                                                         │
│     ├── When to escalate                                               │
│     ├── Who to escalate to                                             │
│     └── What information to provide                                    │
│                                                                        │
│  5. COMMUNICATION                                                      │
│     ├── Status page update template                                    │
│     └── Stakeholder notification list                                  │
│                                                                        │
│  6. POST-INCIDENT                                                      │
│     ├── What to document                                               │
│     └── Follow-up actions                                              │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

3.2 Example Runbook

# Runbook: High Error Rate on Payment Service

## Overview

**Alert**: PaymentServiceErrorRateHigh
**Severity**: SEV1 (pages immediately)
**User Impact**: Customers cannot complete purchases
**Revenue Impact**: ~$10,000/hour during peak

## Quick Assessment (< 2 min)

1. **Open dashboard**: https://grafana.internal/d/payments
2. **Check error rate graph**: Is it actually elevated or a spike that resolved?
3. **Check scope**: All endpoints or specific one?

Query: sum by (endpoint) (rate(payment_errors_total[5m]))

4. **Check timeline**: When did it start? Any correlation with deployments?

If error rate is < 1% and stable, this may be a false alarm. Monitor for 5 min.

## Common Causes and Fixes

### Cause 1: Recent Deployment

**How to identify**:
- Check deployment history: https://deploy.internal/payments
- Did error rate spike after a deployment?

**How to fix**:
```bash
# Rollback to previous version
kubectl rollout undo deployment/payment-service -n production

# Verify rollback
kubectl rollout status deployment/payment-service -n production

Expected result: Error rate should drop within 2-3 minutes.


Cause 2: Database Connection Pool Exhausted

How to identify:

  • Check DB connection metrics in dashboard
  • Look for "connection refused" or "timeout" in logs
    kubectl logs -l app=payment-service -n production | grep -i "connection"
    

How to fix:

  1. Scale up payment service (reduces load per pod):

    kubectl scale deployment/payment-service --replicas=10 -n production
    
  2. If that doesn't help, check if slow queries are holding connections:

    SELECT pid, now() - pg_stat_activity.query_start AS duration, query
    FROM pg_stat_activity
    WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';
    
  3. Kill long-running queries if safe:

    SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
    WHERE duration > interval '2 minutes' AND query NOT LIKE '%pg_stat%';
    

Cause 3: Payment Provider Outage

How to identify:

  • Check payment provider status page: https://status.stripe.com
  • Check external API latency in dashboard
  • Look for timeout errors in logs

How to fix:

  1. If provider has fallback, enable it:

    # Enable Adyen as fallback
    kubectl set env deployment/payment-service PAYMENT_FALLBACK_ENABLED=true
    
  2. If no fallback, consider enabling maintenance mode:

    # Enable maintenance mode for checkout
    kubectl set env deployment/api-gateway CHECKOUT_MAINTENANCE=true
    
  3. Update status page with provider attribution.


Cause 4: Traffic Spike Overwhelming System

How to identify:

  • Check request rate: is it much higher than normal?
  • Check if auto-scaling is running but not keeping up

How to fix:

  1. Manually scale up:

    kubectl scale deployment/payment-service --replicas=20 -n production
    
  2. Enable rate limiting if needed:

    kubectl set env deployment/api-gateway RATE_LIMIT_ENABLED=true
    

Escalation

Escalate if:

  • Issue not resolved within 15 minutes
  • Root cause unclear
  • Database expertise needed
  • Customer communication needed

Escalate to:

  • Primary: @payments-team-lead (Slack)
  • Secondary: @platform-lead
  • SEV1 escalation: Page secondary on-call

Information to provide:

  • Current error rate and duration
  • Steps already taken
  • Suspected cause (if any)
  • Links to relevant logs/dashboards

Communication

Status page update (if > 10 min duration):

Title: Payment Processing Degraded
Body: Some customers may experience errors when completing purchases. 
Our team is actively investigating. We will provide updates every 15 minutes.

Internal notification:

  • Slack: #incidents
  • Include: Error rate, start time, impact, current status

Post-Incident

  • Document timeline in incident channel
  • Create postmortem if SEV1/SEV2
  • Update runbook if new cause found
  • Create follow-up tickets for improvements

---

## Chapter 4: Postmortems

### 4.1 The Blameless Postmortem

BLAMELESS POSTMORTEM PHILOSOPHY

┌────────────────────────────────────────────────────────────────────────┐ │ │ │ THE GOAL OF A POSTMORTEM: │ │ Learn and improve, not punish. │ │ │ │ ═══════════════════════════════════════════════════════════════════ │ │ │ │ BLAMELESS means: │ │ │ │ ✓ People are not the root cause │ │ Systems that allow human error are the cause │ │ │ │ ✓ "Why did the system allow this to happen?" │ │ Not "Why did this person make a mistake?" │ │ │ │ ✓ Everyone operated with good intentions │ │ Given what they knew at the time │ │ │ │ ✓ Focus on systemic fixes │ │ Not "person X should be more careful" │ │ │ │ ═══════════════════════════════════════════════════════════════════ │ │ │ │ WHY BLAMELESS? │ │ │ │ If people fear blame: │ │ ├── They hide mistakes │ │ ├── Incidents go unreported │ │ ├── Near-misses aren't shared │ │ ├── Root causes stay hidden │ │ └── The same incidents happen again │ │ │ │ If people feel safe: │ │ ├── They report issues early │ │ ├── They share near-misses │ │ ├── Root causes are found │ │ ├── Systems get better │ │ └── Incidents are prevented │ │ │ │ ═══════════════════════════════════════════════════════════════════ │ │ │ │ EXAMPLE: │ │ │ │ ❌ BLAME: "The engineer deployed without testing" │ │ │ │ ✓ BLAMELESS: "Our deployment process allowed untested code │ │ to reach production. We need automated test gates." │ │ │ │ The fix is not "tell engineers to test more" │ │ The fix is "make it impossible to deploy without tests" │ │ │ └────────────────────────────────────────────────────────────────────────┘


### 4.2 Postmortem Template

```markdown
# Postmortem: [Incident Title]

## Incident Summary

| Field | Value |
|-------|-------|
| **Date** | YYYY-MM-DD |
| **Duration** | X hours Y minutes |
| **Severity** | SEV1 / SEV2 / SEV3 |
| **Authors** | [Names] |
| **Status** | Draft / In Review / Complete |

### One-Line Summary
[One sentence describing what happened and the impact]

### Impact
- **User Impact**: [What users experienced]
- **Duration**: [How long they were impacted]
- **Scope**: [How many users affected]
- **Revenue Impact**: [$X estimated]
- **Data Impact**: [Any data loss/corruption]

---

## Timeline (All times in UTC)

| Time | Event |
|------|-------|
| 14:30 | Deployment of v2.3.1 to production |
| 14:35 | Error rate begins increasing |
| 14:42 | **DETECTED**: Alert fires for high error rate |
| 14:45 | **ENGAGED**: On-call acknowledges and begins investigation |
| 14:50 | Identified correlation with recent deployment |
| 14:55 | **MITIGATED**: Rollback initiated |
| 15:00 | Error rate returning to normal |
| 15:15 | **RESOLVED**: Confirmed service fully restored |

**Key Metrics**:
- Time to Detect (TTD): 7 minutes
- Time to Engage (TTE): 3 minutes
- Time to Mitigate (TTM): 13 minutes
- Total Incident Duration: 25 minutes

---

## Root Cause Analysis

### What Happened
[Detailed technical explanation of what went wrong]

### Why It Happened (5 Whys)

1. **Why** did the service start returning errors?
   - Because database queries were timing out.

2. **Why** were queries timing out?
   - Because the new code introduced N+1 queries.

3. **Why** did the new code have N+1 queries?
   - Because the ORM change wasn't caught in code review.

4. **Why** wasn't it caught in code review?
   - Because we don't have automated detection for N+1 queries.

5. **Why** don't we have automated detection?
   - Because we haven't prioritized query analysis tooling.

**Root Cause**: Lack of automated N+1 query detection in our CI/CD pipeline.

### Contributing Factors
- Code review didn't catch the ORM change impact
- Load testing was skipped due to time pressure
- Staging environment doesn't have production-scale data

---

## What Went Well
- Alert fired quickly (7 min detection)
- On-call responded promptly
- Rollback procedure worked smoothly
- Communication to stakeholders was timely
- No data loss or corruption

## What Went Poorly
- N+1 query issue wasn't caught before production
- Took 10 minutes to identify the root cause
- No automated canary analysis would have caught this
- Staging data is not representative of production

---

## Action Items

| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | Add N+1 query detection to CI | @alice | P1 | 2024-02-01 | TODO |
| 2 | Implement canary analysis for deployments | @bob | P1 | 2024-02-15 | TODO |
| 3 | Create production-like dataset for staging | @charlie | P2 | 2024-02-28 | TODO |
| 4 | Add ORM query patterns to code review checklist | @alice | P2 | 2024-02-01 | TODO |
| 5 | Document N+1 query patterns in eng wiki | @alice | P3 | 2024-02-07 | TODO |

---

## Lessons Learned

1. **Our deployment pipeline needs more automated checks**
   - We should catch performance regressions before production

2. **Rollback speed is critical**
   - 5-minute rollback time significantly reduced impact

3. **Staging != Production**
   - Data scale differences hide performance issues

---

## Appendix

### Relevant Logs

[Link to log search for incident timeframe]


### Dashboards
- [Link to metrics dashboard during incident]

### Related Incidents
- [Link to similar past incidents if any]

4.3 The Five Whys Technique

THE FIVE WHYS

A technique for finding root causes by asking "why" repeatedly.

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  EXAMPLE 1: Website Outage                                             │
│                                                                        │
│  Problem: Website went down                                            │
│                                                                        │
│  1. Why did the website go down?                                       │
│     → The web server crashed.                                          │
│                                                                        │
│  2. Why did the web server crash?                                      │
│     → It ran out of memory.                                            │
│                                                                        │
│  3. Why did it run out of memory?                                      │
│     → A memory leak accumulated over time.                             │
│                                                                        │
│  4. Why wasn't the memory leak caught?                                 │
│     → We don't have memory profiling in our test suite.                │
│                                                                        │
│  5. Why don't we have memory profiling?                                │
│     → We never prioritized performance testing tooling.                │
│                                                                        │
│  ROOT CAUSE: Lack of performance testing tooling                       │
│  FIX: Add memory profiling to CI/CD pipeline                           │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  EXAMPLE 2: Data Corruption                                            │
│                                                                        │
│  Problem: Customer orders showed wrong prices                          │
│                                                                        │
│  1. Why were prices wrong?                                             │
│     → The price update script had a bug.                               │
│                                                                        │
│  2. Why did the buggy script run in production?                        │
│     → It wasn't tested with production-like data.                      │
│                                                                        │
│  3. Why wasn't it tested with production-like data?                    │
│     → Our staging database is much smaller.                            │
│                                                                        │
│  4. Why is staging database smaller?                                   │
│     → We don't have a process to sync production data.                 │
│                                                                        │
│  5. Why don't we sync production data?                                 │
│     → Privacy concerns haven't been addressed.                         │
│                                                                        │
│  ROOT CAUSE: No process for anonymizing and syncing prod data          │
│  FIX: Create data anonymization pipeline for staging                   │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

TIPS FOR FIVE WHYS:

1. Don't stop at human error
   "Engineer made a mistake" is never the final answer
   
2. Look for systemic issues
   "Process allowed X to happen" is better than "person did X"
   
3. It's OK if it's not exactly 5
   Sometimes it's 3, sometimes it's 7

4. There may be multiple root causes
   Branch the analysis if needed

5. End when you reach something you can fix
   And something that will prevent recurrence

Part II: Implementation

Chapter 5: Incident Management System

5.1 Incident Workflow Implementation

# incident/workflow.py

"""
Incident management workflow system.

Coordinates incident response from detection to resolution.
"""

from dataclasses import dataclass, field
from typing import List, Optional, Dict
from datetime import datetime
from enum import Enum
import uuid


class IncidentStatus(Enum):
    DETECTED = "detected"
    TRIAGING = "triaging"
    INVESTIGATING = "investigating"
    MITIGATING = "mitigating"
    RESOLVED = "resolved"
    CLOSED = "closed"


class IncidentSeverity(Enum):
    SEV1 = 1  # Critical - all hands
    SEV2 = 2  # High - immediate attention
    SEV3 = 3  # Medium - business hours
    SEV4 = 4  # Low - normal queue


@dataclass
class IncidentRole:
    """A role assignment in an incident."""
    role: str  # "incident_commander", "technical_lead", "comms_lead"
    user_id: str
    assigned_at: datetime


@dataclass
class TimelineEvent:
    """An event in the incident timeline."""
    timestamp: datetime
    event_type: str  # "status_change", "action", "communication", "note"
    description: str
    user_id: str
    metadata: Dict = field(default_factory=dict)


@dataclass
class Incident:
    """An incident record."""
    id: str
    title: str
    description: str
    severity: IncidentSeverity
    status: IncidentStatus
    
    # Timing
    detected_at: datetime
    engaged_at: Optional[datetime] = None
    mitigated_at: Optional[datetime] = None
    resolved_at: Optional[datetime] = None
    
    # People
    roles: List[IncidentRole] = field(default_factory=list)
    
    # Timeline
    timeline: List[TimelineEvent] = field(default_factory=list)
    
    # Affected services
    affected_services: List[str] = field(default_factory=list)
    
    # Links
    slack_channel: Optional[str] = None
    video_call_link: Optional[str] = None
    postmortem_link: Optional[str] = None
    
    # Metadata
    tags: List[str] = field(default_factory=list)
    customer_impact: Optional[str] = None
    
    @property
    def time_to_detect(self) -> Optional[float]:
        """Minutes from actual issue start to detection."""
        # This would require knowing when issue actually started
        return None
    
    @property
    def time_to_engage(self) -> Optional[float]:
        """Minutes from detection to first responder."""
        if self.engaged_at:
            return (self.engaged_at - self.detected_at).total_seconds() / 60
        return None
    
    @property
    def time_to_mitigate(self) -> Optional[float]:
        """Minutes from detection to impact mitigation."""
        if self.mitigated_at:
            return (self.mitigated_at - self.detected_at).total_seconds() / 60
        return None
    
    @property
    def time_to_resolve(self) -> Optional[float]:
        """Minutes from detection to full resolution."""
        if self.resolved_at:
            return (self.resolved_at - self.detected_at).total_seconds() / 60
        return None


class IncidentManager:
    """
    Manages incident lifecycle and coordination.
    """
    
    def __init__(
        self,
        slack_client,
        pagerduty_client,
        status_page_client,
        database
    ):
        self.slack = slack_client
        self.pagerduty = pagerduty_client
        self.status_page = status_page_client
        self.db = database
    
    async def create_incident(
        self,
        title: str,
        description: str,
        severity: IncidentSeverity,
        affected_services: List[str],
        triggered_by: str
    ) -> Incident:
        """
        Create a new incident and initiate response.
        """
        incident = Incident(
            id=str(uuid.uuid4()),
            title=title,
            description=description,
            severity=severity,
            status=IncidentStatus.DETECTED,
            detected_at=datetime.utcnow(),
            affected_services=affected_services
        )
        
        # Add creation event
        incident.timeline.append(TimelineEvent(
            timestamp=datetime.utcnow(),
            event_type="status_change",
            description=f"Incident created: {title}",
            user_id=triggered_by
        ))
        
        # Create incident channel
        channel = await self._create_incident_channel(incident)
        incident.slack_channel = channel
        
        # Page on-call if SEV1 or SEV2
        if severity in [IncidentSeverity.SEV1, IncidentSeverity.SEV2]:
            await self._page_on_call(incident)
        
        # Update status page if SEV1
        if severity == IncidentSeverity.SEV1:
            await self._create_status_page_incident(incident)
        
        # Save to database
        await self.db.save_incident(incident)
        
        return incident
    
    async def assign_role(
        self,
        incident_id: str,
        role: str,
        user_id: str,
        assigned_by: str
    ):
        """Assign a role to a user."""
        incident = await self.db.get_incident(incident_id)
        
        # Remove existing role assignment if any
        incident.roles = [r for r in incident.roles if r.role != role]
        
        # Add new assignment
        incident.roles.append(IncidentRole(
            role=role,
            user_id=user_id,
            assigned_at=datetime.utcnow()
        ))
        
        incident.timeline.append(TimelineEvent(
            timestamp=datetime.utcnow(),
            event_type="action",
            description=f"Assigned {role} to {user_id}",
            user_id=assigned_by
        ))
        
        await self.db.save_incident(incident)
        await self._notify_channel(
            incident.slack_channel,
            f"🎯 {user_id} is now {role}"
        )
    
    async def update_status(
        self,
        incident_id: str,
        new_status: IncidentStatus,
        message: str,
        user_id: str
    ):
        """Update incident status."""
        incident = await self.db.get_incident(incident_id)
        old_status = incident.status
        incident.status = new_status
        
        # Update timing fields
        now = datetime.utcnow()
        if new_status == IncidentStatus.INVESTIGATING and not incident.engaged_at:
            incident.engaged_at = now
        elif new_status == IncidentStatus.MITIGATING and not incident.mitigated_at:
            incident.mitigated_at = now
        elif new_status == IncidentStatus.RESOLVED and not incident.resolved_at:
            incident.resolved_at = now
        
        # Add timeline event
        incident.timeline.append(TimelineEvent(
            timestamp=now,
            event_type="status_change",
            description=f"Status: {old_status.value} → {new_status.value}. {message}",
            user_id=user_id
        ))
        
        await self.db.save_incident(incident)
        
        # Notify channel
        status_emoji = {
            IncidentStatus.INVESTIGATING: "🔍",
            IncidentStatus.MITIGATING: "🔧",
            IncidentStatus.RESOLVED: "✅",
        }.get(new_status, "📋")
        
        await self._notify_channel(
            incident.slack_channel,
            f"{status_emoji} Status update: {new_status.value}\n> {message}"
        )
        
        # Update status page
        if incident.severity == IncidentSeverity.SEV1:
            await self._update_status_page(incident, message)
    
    async def add_note(
        self,
        incident_id: str,
        note: str,
        user_id: str
    ):
        """Add a note to the incident timeline."""
        incident = await self.db.get_incident(incident_id)
        
        incident.timeline.append(TimelineEvent(
            timestamp=datetime.utcnow(),
            event_type="note",
            description=note,
            user_id=user_id
        ))
        
        await self.db.save_incident(incident)
        await self._notify_channel(
            incident.slack_channel,
            f"📝 Note from {user_id}: {note}"
        )
    
    async def resolve_incident(
        self,
        incident_id: str,
        resolution_summary: str,
        user_id: str
    ):
        """Resolve the incident."""
        await self.update_status(
            incident_id,
            IncidentStatus.RESOLVED,
            resolution_summary,
            user_id
        )
        
        incident = await self.db.get_incident(incident_id)
        
        # Generate summary
        summary = f"""
🎉 Incident Resolved

**Duration**: {incident.time_to_resolve:.0f} minutes
**Time to Detect**: {incident.time_to_engage:.0f} minutes
**Time to Mitigate**: {incident.time_to_mitigate:.0f} minutes

**Resolution**: {resolution_summary}

Next steps:
1. Schedule postmortem (within 48 hours)
2. Complete any follow-up actions
3. Update documentation if needed
"""
        
        await self._notify_channel(incident.slack_channel, summary)
        
        # Close status page incident
        if incident.severity == IncidentSeverity.SEV1:
            await self._close_status_page_incident(incident)
    
    async def generate_postmortem_draft(
        self,
        incident_id: str
    ) -> str:
        """Generate a postmortem template from incident data."""
        incident = await self.db.get_incident(incident_id)
        
        # Format timeline
        timeline_table = "| Time | Event |\n|------|-------|\n"
        for event in incident.timeline:
            time_str = event.timestamp.strftime("%H:%M UTC")
            timeline_table += f"| {time_str} | {event.description} |\n"
        
        template = f"""
# Postmortem: {incident.title}

## Incident Summary

| Field | Value |
|-------|-------|
| **Date** | {incident.detected_at.strftime("%Y-%m-%d")} |
| **Duration** | {incident.time_to_resolve:.0f} minutes |
| **Severity** | SEV{incident.severity.value} |
| **Status** | Draft |

### One-Line Summary
[TODO: One sentence describing what happened and the impact]

### Impact
- **User Impact**: [TODO]
- **Duration**: {incident.time_to_resolve:.0f} minutes
- **Scope**: [TODO]

---

## Timeline (All times in UTC)

{timeline_table}

**Key Metrics**:
- Time to Engage (TTE): {incident.time_to_engage:.0f} minutes
- Time to Mitigate (TTM): {incident.time_to_mitigate:.0f} minutes
- Total Incident Duration: {incident.time_to_resolve:.0f} minutes

---

## Root Cause Analysis

### What Happened
[TODO: Detailed technical explanation]

### Why It Happened (5 Whys)
[TODO: Apply 5 whys technique]

---

## Action Items

| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [TODO] | [TODO] | P1 | [TODO] | TODO |

---
"""
        return template

Chapter 6: Communication During Incidents

6.1 Internal Communication

INTERNAL COMMUNICATION DURING INCIDENTS

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  INCIDENT CHANNEL STRUCTURE                                            │
│                                                                        │
│  Create a dedicated Slack channel: #inc-YYYY-MM-DD-short-description   │
│                                                                        │
│  Example: #inc-2024-01-15-payment-errors                               │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  CHANNEL TOPIC (set immediately):                                      │
│                                                                        │
│  "SEV1: Payment errors affecting checkout                              │
│   IC: @alice | Tech: @bob | Comms: @charlie                            │
│   Status: Investigating | Started: 14:42 UTC"                          │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  UPDATE CADENCE:                                                       │
│                                                                        │
│  SEV1: Every 15 minutes minimum                                        │
│  SEV2: Every 30 minutes minimum                                        │
│  SEV3: Every hour or at major milestones                               │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  UPDATE FORMAT:                                                        │
│                                                                        │
│  [14:55] 🔍 Status: Investigating                                      │
│  - Identified correlation with deployment at 14:30                     │
│  - Error rate: 5% (down from 8%)                                       │
│  - Working on: Rollback decision                                       │
│  - Next update: 15:10                                                  │
│                                                                        │
│  [15:10] 🔧 Status: Mitigating                                         │
│  - Rollback initiated at 15:05                                         │
│  - Error rate: 2% (decreasing)                                         │
│  - ETA to resolution: 10 minutes                                       │
│  - Next update: 15:20 or when resolved                                 │
│                                                                        │
│  [15:18] ✅ Status: Resolved                                           │
│  - Rollback complete                                                   │
│  - Error rate: 0.5% (normal)                                           │
│  - Duration: 36 minutes                                                │
│  - Postmortem to be scheduled                                          │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

6.2 External Communication

EXTERNAL COMMUNICATION (STATUS PAGE)

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  PRINCIPLES:                                                           │
│                                                                        │
│  1. BE HONEST                                                          │
│     Don't hide issues. Customers know anyway.                          │
│     Transparency builds trust.                                         │
│                                                                        │
│  2. BE TIMELY                                                          │
│     First update within 10 minutes of detection.                       │
│     Updates every 15-30 minutes during incident.                       │
│                                                                        │
│  3. BE CLEAR                                                           │
│     Plain language, not technical jargon.                              │
│     Focus on impact to users.                                          │
│                                                                        │
│  4. BE ACCOUNTABLE                                                     │
│     Don't blame third parties.                                         │
│     Own the customer experience.                                       │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  STATUS PAGE UPDATE TEMPLATES:                                         │
│                                                                        │
│  INVESTIGATING:                                                        │
│  "We are investigating reports of [issue]. Some customers may          │
│   experience [impact]. We will provide updates every 15 minutes."      │
│                                                                        │
│  IDENTIFIED:                                                           │
│  "We have identified the cause of [issue] and are working on           │
│   a fix. [Impact] is ongoing. We expect to resolve this within         │
│   [timeframe]."                                                        │
│                                                                        │
│  MONITORING:                                                           │
│  "We have implemented a fix and are monitoring the results.            │
│   [Impact] should be resolving. We will confirm resolution             │
│   shortly."                                                            │
│                                                                        │
│  RESOLVED:                                                             │
│  "This incident has been resolved. [Brief explanation].                │
│   We apologize for any inconvenience and are taking steps to           │
│   prevent this from happening again."                                  │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  WHAT NOT TO SAY:                                                      │
│                                                                        │
│  ❌ "A third-party provider caused this outage"                        │
│     (Customers don't care, you chose the provider)                     │
│                                                                        │
│  ❌ "An engineer deployed a bug"                                       │
│     (Internal details, sounds like blame)                              │
│                                                                        │
│  ❌ "We're not sure what's happening"                                  │
│     (Be honest but not helpless)                                       │
│                                                                        │
│  ❌ Technical jargon: "Database replication lag"                       │
│     (Customers don't understand or care)                               │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Part III: Real-World Application

Chapter 7: Case Studies

7.1 How Google Does Incident Response

GOOGLE'S INCIDENT RESPONSE

From the SRE book and public talks:

STRUCTURE:
├── Incident Commander (IC)
│   └── Single point of decision-making
├── Operations Lead (Ops)
│   └── Works directly on the problem
├── Communications Lead (Comms)
│   └── Handles stakeholder updates
└── Planning Lead (for long incidents)
    └── Manages logistics, handoffs

KEY PRACTICES:

1. CLEAR COMMAND STRUCTURE
   ├── IC makes final decisions
   ├── IC doesn't debug - they coordinate
   ├── IC can delegate decisions explicitly
   └── Only one IC at a time

2. WAR ROOMS
   ├── Virtual or physical gathering point
   ├── All responders in one channel
   ├── Reduces communication lag
   └── Video call for complex incidents

3. ROLE HANDOFFS
   ├── Incidents can last hours or days
   ├── Explicit handoff process
   ├── "I'm handing IC to [name]"
   ├── Handoff briefing: current state, next steps
   └── Acknowledgment required

4. POSTMORTEM CULTURE
   ├── Postmortem for all significant incidents
   ├── Blameless by policy
   ├── Action items tracked to completion
   └── Postmortems shared widely for learning

5. ERROR BUDGETS AFFECT RESPONSE
   ├── If error budget exhausted, stricter process
   ├── Feature freeze until reliability improved
   └── Incident response tied to SLO

LESSON: The structure and roles matter more than individual heroics.

7.2 How Stripe Does Incident Response

STRIPE'S APPROACH

Context: Payments = zero tolerance for errors

KEY PRACTICES:

1. ON-CALL EXCELLENCE
   ├── Extensive on-call training
   ├── Shadow shifts before going live
   ├── Regular game days (practice incidents)
   └── On-call is a respected role

2. RUNBOOKS FOR EVERYTHING
   ├── Every alert has a runbook
   ├── Runbooks tested regularly
   ├── If no runbook, create one during incident
   └── Runbooks reviewed in postmortems

3. AUTOMATED RESPONSE
   ├── Automatic rollback on error spike
   ├── Automatic scaling on load
   ├── Automatic circuit breaking
   └── Reduce need for human intervention

4. RAPID POSTMORTEMS
   ├── Postmortem within 48 hours
   ├── Short format for quick turnaround
   ├── Focus on action items
   └── Follow-up tracked in system

5. LEARNING FROM NEAR-MISSES
   ├── Report issues even if no impact
   ├── Analyze near-misses like incidents
   ├── Find systemic issues before they cause outages
   └── "Near miss → fix it before it's a miss"

LESSON: For high-stakes systems, invest heavily in prevention and preparation.

Chapter 8: Common Mistakes

INCIDENT RESPONSE ANTI-PATTERNS

❌ MISTAKE 1: Too Many Cooks

Wrong:
  ├── Everyone jumps in to help
  ├── Multiple people making changes
  ├── No coordination
  └── Changes conflict with each other

Right:
  ├── Clear roles assigned
  ├── IC coordinates all changes
  ├── One person makes changes at a time
  └── "I'm changing X" announced before acting


❌ MISTAKE 2: Skipping Communication

Wrong:
  ├── Heads down fixing the problem
  ├── Stakeholders don't know what's happening
  ├── Customers learn about outage from Twitter
  └── Leadership surprised

Right:
  ├── First status update within 10 minutes
  ├── Regular updates even if no progress
  ├── Status page updated immediately
  └── Leadership briefed for SEV1


❌ MISTAKE 3: Blame Culture

Wrong:
  ├── "Who deployed this?"
  ├── "How could you not test this?"
  ├── Public shaming in postmortem
  └── People hide mistakes

Right:
  ├── Focus on what, not who
  ├── "How did our process allow this?"
  ├── Systemic fixes, not personal criticism
  └── Safe to report issues


❌ MISTAKE 4: No Postmortem

Wrong:
  ├── Incident resolved, move on
  ├── No analysis of what happened
  ├── Same incident happens again
  └── No improvement

Right:
  ├── Postmortem within 48 hours
  ├── Root cause analysis
  ├── Action items assigned and tracked
  └── Learnings shared widely


❌ MISTAKE 5: Alert Fatigue

Wrong:
  ├── 50 pages per week
  ├── Most alerts are false positives
  ├── On-call ignores alerts
  └── Real issues missed

Right:
  ├── < 5 pages per week (target)
  ├── Every alert is actionable
  ├── False positives are fixed
  └── On-call trusts alerts

Part IV: Interview Preparation

Chapter 9: Interview Tips

9.1 Incident Management Discussion Framework

DISCUSSING INCIDENTS IN INTERVIEWS

When asked "How would you handle an incident?":

1. STRUCTURE IS KEY
   "First, I'd establish clear roles:
   - Incident Commander for coordination
   - Technical Lead for debugging
   - Comms Lead for stakeholders
   This prevents chaos and ensures coordination."

2. MITIGATION OVER ROOT CAUSE
   "My first priority is mitigating user impact.
   Can we rollback? Can we failover?
   Root cause investigation comes after
   we've stopped the bleeding."

3. COMMUNICATION
   "I'd set up an incident channel,
   update stakeholders every 15 minutes,
   and update the status page within 10 minutes.
   Silence is worse than 'we're investigating.'"

4. LEARNING
   "After resolution, I'd run a blameless postmortem.
   The goal is to find systemic issues and
   prevent recurrence, not assign blame.
   We'd track action items to completion."

5. PREVENTION
   "The best incident is one that doesn't happen.
   I'd invest in monitoring, alerting,
   runbooks, and regular game days
   to prepare the team."

9.2 Key Phrases

INCIDENT MANAGEMENT KEY PHRASES

On Response:
"My first question is 'what's the user impact?'
That determines severity and urgency.
A backend error with no user impact
is different from checkout being down."

On Roles:
"I believe in clear incident command.
One person coordinates, others execute.
Without this, you get five people
all trying to fix the problem differently."

On Communication:
"I'd update stakeholders every 15 minutes,
even if the update is 'still investigating.'
Silence causes more panic than honest updates
about uncertain situations."

On Postmortems:
"Postmortems must be blameless.
If we blame individuals, people hide mistakes.
The question is 'how did our system allow this?'
not 'who made the error?'"

On Prevention:
"Every incident should produce action items
that prevent recurrence. If we don't learn,
we're just firefighting forever.
Track action items like any other work."

On Alert Design:
"Every page should be actionable and important.
If an alert fires and the response is 'wait and see,'
it shouldn't be a page.
Alert fatigue causes missed real issues."

Chapter 10: Practice Problems

Problem 1: Payment Outage

Scenario: You're on-call. At 2 AM, you're paged: "Payment success rate dropped from 99% to 85%."

Questions:

  1. What's your first 5 minutes?
  2. How do you communicate?
  3. What's your mitigation strategy?
  1. First 5 minutes:

    • Acknowledge the page
    • Check dashboard for scope and timeline
    • Is it all payments or specific type?
    • When did it start? Correlate with deployments
    • Quick assessment: real issue or false alarm?
  2. Communication:

    • Create incident channel: #inc-2024-01-15-payment-degradation
    • Declare SEV1 (payments are critical)
    • Page secondary on-call for support
    • Update status page: "Investigating payment processing issues"
    • Notify leadership via Slack/text
  3. Mitigation:

    • If recent deployment: consider rollback
    • If external provider: check their status, enable fallback
    • If load-related: scale up, enable rate limiting
    • Priority: restore payments, investigate later

Problem 2: Postmortem Scenario

Scenario: An engineer deployed code that caused a 30-minute outage. Leadership wants to know "who did this."

Questions:

  1. How do you handle the leadership question?
  2. What should the postmortem focus on?
  3. What action items might emerge?
  1. Handling leadership:

    • Redirect from "who" to "how"
    • "The individual is less important than understanding how our process allowed this to happen"
    • Share that blameless culture leads to better outcomes
    • Offer to present postmortem findings
  2. Postmortem focus:

    • Why did the code have a bug? (Testing gap?)
    • Why wasn't it caught in code review? (Reviewer didn't know to look?)
    • Why wasn't it caught in staging? (Data difference?)
    • Why wasn't rollback faster? (Process gap?)
  3. Action items:

    • Add automated test for this scenario
    • Update code review checklist
    • Improve staging data to match production
    • Reduce rollback time with automation
    • Share learnings with other teams

Chapter 11: Sample Interview Dialogue

Interviewer: "Walk me through how you'd handle a major production incident."

You: "Let me walk through the complete lifecycle.

Detection and Triage:"

When the alert fires:
1. Acknowledge within 5 minutes
2. Quick assessment: Is this real? What's the scope?
3. Declare severity based on user impact
4. For SEV1/2: Create incident channel, page backup

"For a SEV1 like 'checkout is down', I'd immediately escalate to a war room model:"

War Room Setup:
├── Slack channel: #inc-2024-01-15-checkout-down
├── Video call for complex coordination
├── Assign roles:
│   ├── Myself as IC (or delegate if I'm debugging)
│   ├── Senior engineer as Technical Lead
│   └── Manager or senior as Comms Lead

Interviewer: "How do you decide what to do first?"

You: "Mitigation over investigation. The priority is stopping user impact.

My decision tree:"

1. Was there a recent deployment?
   YES → Can we rollback safely?
   
2. Is a dependency down?
   YES → Can we failover or degrade gracefully?
   
3. Are we overloaded?
   YES → Can we scale up or rate limit?

I'm looking for the fastest way to restore service,
even if it's not the 'correct' fix.
We investigate root cause after users are happy.

Interviewer: "What about communication during the incident?"

You: "Communication is as important as the fix itself.

My communication plan:"

Status Page:
├── First update within 10 minutes
├── Template: "We are investigating issues with [X]"
├── Update every 15 minutes even if no change
└── Include ETA when known

Internal:
├── Update incident channel every 10-15 minutes
├── Format: Status | What we know | What we're doing | Next update
├── Escalate to leadership for SEV1

Stakeholders:
├── Customer support team briefed
├── Sales team notified if major customers affected
└── Executive summary for leadership

Interviewer: "After it's resolved, then what?"

You: "The incident isn't really over until we've learned from it.

Post-incident:"

1. Within 48 hours: Blameless postmortem
   ├── Timeline reconstruction
   ├── 5 Whys root cause analysis
   ├── What went well / what went poorly
   └── Action items assigned with owners and dates

2. Action items tracked like any other work
   ├── In sprint planning
   ├── Regular check-ins
   └── Actually get done (not just documented)

3. Share learnings
   ├── Post summary in engineering channel
   ├── Update runbooks if needed
   └── Consider if training needed

Interviewer: "Good comprehensive answer. One last question: how do you prevent incidents?"

You: "Prevention is the best incident response.

Prevention strategies:"

1. Observability: Can't fix what you can't see
   ├── Metrics, logs, traces all correlated
   └── Dashboards for key services

2. Alerting: Right alerts that are actionable
   ├── Every alert has a runbook
   └── Regular review of alert quality

3. Testing: Catch issues before production
   ├── Load testing for capacity
   ├── Chaos engineering for resilience
   └── Canary deployments for safety

4. Game days: Practice incident response
   ├── Simulated incidents quarterly
   └── Test runbooks and communication

5. Culture: Safe to report issues
   ├── Blameless postmortems
   └── Celebrate near-miss reports

Summary

┌────────────────────────────────────────────────────────────────────────┐
│                    DAY 5 KEY TAKEAWAYS                                 │
│                                                                        │
│  INCIDENT LIFECYCLE:                                                   │
│  Detection → Triage → Response → Mitigation → Resolution → Learning    │
│                                                                        │
│  KEY ROLES:                                                            │
│  ├── Incident Commander: Coordinates, doesn't debug                    │
│  ├── Technical Lead: Diagnoses and fixes                               │
│  └── Communications Lead: Updates stakeholders                         │
│                                                                        │
│  PRIORITIES:                                                           │
│  ├── Mitigation first, investigation second                            │
│  ├── Communication is not optional                                     │
│  └── Restore service, then find root cause                             │
│                                                                        │
│  ON-CALL:                                                              │
│  ├── Sustainable rotations (≥8 people)                                 │
│  ├── Actionable alerts only                                            │
│  ├── Runbooks for every alert                                          │
│  └── Proper handoffs between shifts                                    │
│                                                                        │
│  POSTMORTEMS:                                                          │
│  ├── Blameless (systems fail, not people)                              │
│  ├── Within 48 hours while memory fresh                                │
│  ├── 5 Whys for root cause                                             │
│  └── Action items tracked to completion                                │
│                                                                        │
│  COMMUNICATION:                                                        │
│  ├── First update within 10 minutes                                    │
│  ├── Regular updates even if no progress                               │
│  ├── Plain language, not technical jargon                              │
│  └── Own the customer experience                                       │
│                                                                        │
│  KEY INSIGHT:                                                          │
│  The best incident response is built on preparation:                   │
│  good alerts, tested runbooks, practiced teams.                        │
│  Heroics are a sign of system failure.                                 │ 
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Further Reading

Books:

  • "Site Reliability Engineering" by Google (Free online)
  • "The Phoenix Project" by Gene Kim
  • "Incident Management for Operations" by Rob Schnepp

Resources:

  • PagerDuty Incident Response Guide (free)
  • Atlassian Incident Management Handbook
  • Google SRE Book - Chapter on Incident Response

Tools:

  • PagerDuty: Alerting and on-call management
  • Incident.io: Incident management workflows
  • Statuspage: External communication
  • Jeli: Incident analysis and learning

End of Day 5: Incident Management

End of Week 10: Production Readiness and Operational Excellence


Week 10 Complete!

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│                    WEEK 10 COMPLETE                                    │
│           Production Readiness and Operational Excellence              │
│                                                                        │
│  Day 1: SLIs, SLOs, SLAs                                               │
│         → Define what "healthy" means                                  │
│                                                                        │
│  Day 2: Observability                                                  │
│         → See whether you're healthy                                   │
│                                                                        │
│  Day 3: Deployment Strategies                                          │
│         → Maintain health through change                               │
│                                                                        │
│  Day 4: Capacity Planning                                              │
│         → Ensure health under load                                     │
│                                                                        │
│  Day 5: Incident Management                                            │
│         → Respond when health fails                                    │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  THE PRODUCTION ENGINEER'S TOOLKIT:                                    │
│                                                                        │
│  ✓ SLOs tell you what healthy means                                    │
│  ✓ Observability tells you if you're healthy                           │
│  ✓ Deployment strategies keep you healthy through change               │
│  ✓ Capacity planning keeps you healthy under load                      │
│  ✓ Incident management restores health when it fails                   │
│                                                                        │
│  Together, these skills let you operate reliable systems               │
│  that serve users well, even at scale, even under stress.              │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Next: Week 10 Capstone — Bringing together everything from the entire 10-week program in a comprehensive system design interview simulation.

You've built the complete toolkit of a senior production engineer. The capstone will test your ability to apply all of it together.