Himanshu Kukreja
0%
LearnSystem DesignWeek 10Deployment Strategies
Day 03

Week 10 — Day 3: Deployment Strategies

System Design Mastery Series — Production Readiness and Operational Excellence


Preface

You can define what healthy means (SLOs). You can see if you're healthy (observability).

Now: how do you ship changes without breaking that health?

THE DEPLOYMENT PARADOX

To improve a system, you must change it.
To keep a system stable, you must not change it.

Every deployment is a risk:
├── New code might have bugs
├── New config might be wrong
├── New dependencies might fail
├── New scale might break assumptions
└── Humans might make mistakes

Yet we MUST deploy:
├── Fix bugs
├── Add features
├── Patch security
├── Improve performance
└── Stay competitive

THE SOLUTION:
Deploy frequently, but deploy SAFELY.

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  DEPLOYMENT PHILOSOPHY                                                 │
│                                                                        │
│  OLD WAY:                                                              │
│  ├── Deploy monthly (or quarterly!)                                    │
│  ├── Big bang releases                                                 │
│  ├── "Change freeze" before releases                                   │
│  ├── All-hands-on-deck deployment days                                 │
│  └── Hope nothing breaks                                               │
│                                                                        │
│  MODERN WAY:                                                           │
│  ├── Deploy daily (or more!)                                           │
│  ├── Small, incremental changes                                        │
│  ├── Automated pipelines                                               │
│  ├── Gradual rollouts with monitoring                                  │
│  └── Automatic rollback if problems detected                           │
│                                                                        │
│  The safest deployment is a small one.                                 │
│  The safest rollback is a fast one.                                    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Today, we learn to ship changes without fear.


Part I: Foundations

Chapter 1: Deployment Strategies Overview

1.1 The Deployment Strategy Spectrum

DEPLOYMENT STRATEGIES

┌───────────────────────────────────────────────────────────────────────┐
│                                                                       │
│  RISK ◄─────────────────────────────────────────────────────► SAFETY  │
│                                                                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
│  │   Big    │  │ Rolling  │  │  Blue-   │  │  Canary  │  │ Feature  │ │
│  │   Bang   │  │          │  │  Green   │  │          │  │  Flags   │ │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
│                                                                       │
│  All at once   Gradual       Instant       Gradual       Code live,   │
│  High risk     replacement   Switchover   % rollout     not active    │
│  Fast          Medium risk   Fast rollback Safest        Decoupled    │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

STRATEGY COMPARISON:

│ Strategy    │ Downtime │ Rollback    │ Resource Cost │ Complexity │
│─────────────│──────────│─────────────│───────────────│────────────│
│ Big Bang    │ Yes      │ Full redeploy│ 1x           │ Low        │
│ Rolling     │ No       │ Slow        │ 1x + buffer   │ Medium     │
│ Blue-Green  │ No       │ Instant     │ 2x            │ Medium     │
│ Canary      │ No       │ Fast        │ 1x + small    │ High       │
│ Feature Flag│ No       │ Instant     │ 1x            │ High       │

1.2 Big Bang Deployment (Don't Do This)

BIG BANG DEPLOYMENT

What it is:
├── Stop old version
├── Deploy new version
├── Start new version
├── Hope it works

Timeline:
  ════════════════╦════════════════════════════════════════
  v1.0 Running    ║ DOWNTIME  ║  v1.1 Running
  ════════════════╩═══════════╩════════════════════════════
                  ↑           ↑
                  Stop v1     Start v1.1

Problems:
├── Downtime during deployment
├── All users hit new version at once
├── If broken, ALL users affected
├── Rollback requires another deployment
├── High stress, high risk

When it's acceptable:
├── Development environment
├── Non-critical internal tools
├── Scheduled maintenance windows
└── When other options truly impossible

For production services: AVOID.

1.3 Rolling Deployment

ROLLING DEPLOYMENT

What it is:
├── Gradually replace old instances with new
├── One (or few) at a time
├── Health checks before proceeding
├── No downtime

Timeline:
  Instance 1: ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  Instance 2: ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░
  Instance 3: ████████████████████████████░░░░░░░░░░░░░░░░
  Instance 4: ████████████████████████████████████░░░░░░░░
  Instance 5: ████████████████████████████████████████████
              ═══════════════════════════════════════════►
              ████ = v1.0    ░░░░ = v1.1

Traffic distribution during rollout:
  Start:     100% v1.0, 0% v1.1
  25% done:  75% v1.0, 25% v1.1
  50% done:  50% v1.0, 50% v1.1
  75% done:  25% v1.0, 75% v1.1
  Complete:  0% v1.0, 100% v1.1

Advantages:
├── No downtime
├── Gradual risk exposure
├── Can pause if issues detected
└── Natural load balancing

Disadvantages:
├── Two versions running simultaneously
├── Rollback is slow (reverse the process)
├── Database must support both versions
└── API must be backward compatible

Chapter 2: Blue-Green Deployment

2.1 How Blue-Green Works

BLUE-GREEN DEPLOYMENT

Architecture:
┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│                         Load Balancer                                  │
│                              │                                         │
│                              │ (switches between)                      │
│                              │                                         │
│              ┌───────────────┴───────────────┐                         │
│              │                               │                         │
│              ▼                               ▼                         │
│       ┌─────────────┐                 ┌─────────────┐                  │
│       │    BLUE     │                 │   GREEN     │                  │
│       │  (Current)  │                 │   (New)     │                  │
│       │             │                 │             │                  │
│       │  ┌───────┐  │                 │  ┌───────┐  │                  │
│       │  │ App 1 │  │                 │  │ App 1 │  │                  │
│       │  ├───────┤  │                 │  ├───────┤  │                  │
│       │  │ App 2 │  │                 │  │ App 2 │  │                  │
│       │  ├───────┤  │                 │  ├───────┤  │                  │
│       │  │ App 3 │  │                 │  │ App 3 │  │                  │
│       │  └───────┘  │                 │  └───────┘  │                  │
│       └─────────────┘                 └─────────────┘                  │
│              │                               │                         │
│              └───────────────┬───────────────┘                         │
│                              │                                         │
│                              ▼                                         │
│                     Shared Database                                    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Deployment Process:

1. CURRENT STATE
   ├── Blue: Running v1.0, receiving traffic
   └── Green: Idle (or running old version)

2. DEPLOY NEW VERSION
   ├── Blue: Still running v1.0, receiving traffic
   └── Green: Deploy v1.1, run smoke tests

3. SWITCH TRAFFIC
   ├── Blue: Running v1.0, NO traffic
   └── Green: Running v1.1, receiving ALL traffic

4. VERIFY
   ├── Monitor metrics, logs, errors
   └── If problems: Switch back to Blue (instant rollback)

5. CLEANUP
   └── Blue becomes the standby for next deployment

2.2 Blue-Green Implementation

# deployment/blue_green.py

"""
Blue-Green deployment orchestration.

This manages the switch between environments
and provides instant rollback capability.
"""

from dataclasses import dataclass
from enum import Enum
from typing import Optional
from datetime import datetime
import logging

logger = logging.getLogger(__name__)


class Environment(Enum):
    BLUE = "blue"
    GREEN = "green"


class DeploymentState(Enum):
    IDLE = "idle"
    DEPLOYING = "deploying"
    TESTING = "testing"
    SWITCHING = "switching"
    VERIFYING = "verifying"
    COMPLETE = "complete"
    ROLLED_BACK = "rolled_back"


@dataclass
class DeploymentRecord:
    """Record of a deployment."""
    id: str
    version: str
    started_at: datetime
    completed_at: Optional[datetime]
    state: DeploymentState
    active_environment: Environment
    previous_environment: Environment
    deployed_by: str
    rollback_reason: Optional[str] = None


class BlueGreenDeployer:
    """
    Orchestrates blue-green deployments.
    """
    
    def __init__(
        self,
        load_balancer,
        blue_cluster,
        green_cluster,
        health_checker,
        metrics_client
    ):
        self.lb = load_balancer
        self.blue = blue_cluster
        self.green = green_cluster
        self.health = health_checker
        self.metrics = metrics_client
        
        self.current_environment = None
        self.current_deployment = None
    
    async def deploy(
        self,
        version: str,
        deployed_by: str,
        skip_tests: bool = False
    ) -> DeploymentRecord:
        """
        Execute a blue-green deployment.
        """
        # Determine which environment to deploy to
        if self.current_environment == Environment.BLUE:
            target = Environment.GREEN
            target_cluster = self.green
        else:
            target = Environment.BLUE
            target_cluster = self.blue
        
        deployment = DeploymentRecord(
            id=str(uuid.uuid4()),
            version=version,
            started_at=datetime.utcnow(),
            completed_at=None,
            state=DeploymentState.DEPLOYING,
            active_environment=self.current_environment,
            previous_environment=self.current_environment,
            deployed_by=deployed_by
        )
        
        try:
            # Step 1: Deploy to inactive environment
            logger.info(f"Deploying {version} to {target.value}")
            deployment.state = DeploymentState.DEPLOYING
            
            await target_cluster.deploy(version)
            
            # Step 2: Run health checks and tests
            logger.info(f"Running health checks on {target.value}")
            deployment.state = DeploymentState.TESTING
            
            if not skip_tests:
                health_result = await self.health.check_cluster(target_cluster)
                if not health_result.healthy:
                    raise DeploymentError(f"Health check failed: {health_result.reason}")
                
                # Run smoke tests
                test_result = await self._run_smoke_tests(target_cluster)
                if not test_result.passed:
                    raise DeploymentError(f"Smoke tests failed: {test_result.failures}")
            
            # Step 3: Switch traffic
            logger.info(f"Switching traffic to {target.value}")
            deployment.state = DeploymentState.SWITCHING
            
            await self.lb.switch_to(target)
            
            # Step 4: Verify in production
            logger.info("Verifying deployment")
            deployment.state = DeploymentState.VERIFYING
            
            # Wait and check metrics
            await asyncio.sleep(30)  # Give it 30 seconds
            
            verification = await self._verify_deployment(version)
            if not verification.success:
                logger.warning(f"Verification failed: {verification.reason}")
                await self.rollback(deployment, verification.reason)
                return deployment
            
            # Step 5: Complete
            deployment.state = DeploymentState.COMPLETE
            deployment.completed_at = datetime.utcnow()
            deployment.active_environment = target
            
            self.current_environment = target
            self.current_deployment = deployment
            
            logger.info(f"Deployment {deployment.id} complete")
            
            return deployment
            
        except Exception as e:
            logger.error(f"Deployment failed: {e}")
            await self.rollback(deployment, str(e))
            raise
    
    async def rollback(
        self,
        deployment: DeploymentRecord,
        reason: str
    ):
        """
        Rollback to previous environment.
        
        This is INSTANT because the old environment is still running.
        """
        logger.warning(f"Rolling back deployment {deployment.id}: {reason}")
        
        deployment.state = DeploymentState.ROLLED_BACK
        deployment.rollback_reason = reason
        deployment.completed_at = datetime.utcnow()
        
        # Switch traffic back to previous environment
        await self.lb.switch_to(deployment.previous_environment)
        
        # Record rollback metric
        self.metrics.increment(
            "deployments_rolled_back_total",
            labels={"version": deployment.version, "reason": reason}
        )
        
        logger.info(f"Rollback complete. Traffic restored to {deployment.previous_environment.value}")
    
    async def _run_smoke_tests(self, cluster) -> TestResult:
        """Run smoke tests against a cluster."""
        tests = [
            self._test_health_endpoint(cluster),
            self._test_api_endpoint(cluster),
            self._test_database_connectivity(cluster),
        ]
        
        results = await asyncio.gather(*tests, return_exceptions=True)
        failures = [r for r in results if isinstance(r, Exception) or not r.passed]
        
        return TestResult(
            passed=len(failures) == 0,
            failures=[str(f) for f in failures]
        )
    
    async def _verify_deployment(self, version: str) -> VerificationResult:
        """Verify deployment is healthy in production."""
        
        # Check error rate
        error_rate = await self.metrics.query(
            'sum(rate(http_requests_total{status=~"5.."}[1m])) / '
            'sum(rate(http_requests_total[1m]))'
        )
        
        if error_rate > 0.01:  # > 1% error rate
            return VerificationResult(
                success=False,
                reason=f"Error rate too high: {error_rate:.2%}"
            )
        
        # Check latency
        p99_latency = await self.metrics.query(
            'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))'
        )
        
        if p99_latency > 1.0:  # > 1 second
            return VerificationResult(
                success=False,
                reason=f"Latency too high: {p99_latency:.2f}s"
            )
        
        return VerificationResult(success=True, reason=None)

2.3 Blue-Green Pros and Cons

BLUE-GREEN: WHEN TO USE

✅ ADVANTAGES:
├── Instant rollback (just switch back)
├── Full testing before switch
├── Zero downtime
├── Clean separation of environments
└── Easy to understand

❌ DISADVANTAGES:
├── Double infrastructure cost
├── Database must support both versions
├── Long-running connections may be disrupted
├── Not gradual (all-or-nothing switch)
└── Requires careful state management

BEST FOR:
├── Applications where instant rollback is critical
├── Teams with mature CI/CD pipelines
├── Services with clear version boundaries
└── When you can afford 2x infrastructure

NOT IDEAL FOR:
├── Services with lots of persistent connections
├── Applications with complex database migrations
├── Extremely high-traffic services (state issues)
└── Teams without automation

Chapter 3: Canary Deployment

3.1 How Canary Works

CANARY DEPLOYMENT

Named after "canary in a coal mine" — test with a small group first.

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  STAGE 1: CANARY (1% traffic)                                          │
│                                                                        │
│                         Load Balancer                                  │
│                              │                                         │
│                   ┌──────────┴──────────┐                              │
│                   │                     │                              │
│                   ▼ 99%                 ▼ 1%                           │
│           ┌─────────────┐       ┌─────────────┐                        │
│           │   STABLE    │       │   CANARY    │                        │
│           │   v1.0      │       │   v1.1      │                        │
│           │  10 pods    │       │   1 pod     │                        │
│           └─────────────┘       └─────────────┘                        │
│                                                                        │
│  Monitor for 15 minutes:                                               │
│  ├── Compare error rate: canary vs stable                              │
│  ├── Compare latency: canary vs stable                                 │
│  └── Check business metrics                                            │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  STAGE 2: EXPAND (10% traffic)                                         │
│                                                                        │
│                   ┌──────────┴──────────┐                              │
│                   │                     │                              │
│                   ▼ 90%                 ▼ 10%                          │
│           ┌─────────────┐       ┌─────────────┐                        │
│           │   STABLE    │       │   CANARY    │                        │
│           │   v1.0      │       │   v1.1      │                        │
│           │   9 pods    │       │   2 pods    │                        │
│           └─────────────┘       └─────────────┘                        │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  STAGE 3: EXPAND (50% traffic)                                         │
│                                                                        │
│                   ┌──────────┴──────────┐                              │
│                   │                     │                              │
│                   ▼ 50%                 ▼ 50%                          │
│           ┌─────────────┐       ┌─────────────┐                        │
│           │   STABLE    │       │   CANARY    │                        │
│           │   v1.0      │       │   v1.1      │                        │
│           │   5 pods    │       │   5 pods    │                        │
│           └─────────────┘       └─────────────┘                        │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  STAGE 4: COMPLETE (100% traffic)                                      │
│                                                                        │
│                              │                                         │
│                              ▼ 100%                                    │
│                      ┌─────────────┐                                   │
│                      │   STABLE    │                                   │
│                      │   v1.1      │                                   │
│                      │  10 pods    │                                   │
│                      └─────────────┘                                   │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

3.2 Canary Implementation

# deployment/canary.py

"""
Canary deployment with automated analysis.

Gradually rolls out new version while comparing
metrics against the stable version.
"""

from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime, timedelta
from enum import Enum
import asyncio
import logging

logger = logging.getLogger(__name__)


@dataclass
class CanaryStage:
    """A stage in the canary rollout."""
    name: str
    traffic_percentage: int
    duration_minutes: int
    auto_promote: bool = True  # Auto-promote if metrics pass


@dataclass
class CanaryConfig:
    """Configuration for canary deployment."""
    stages: List[CanaryStage] = field(default_factory=lambda: [
        CanaryStage("canary-1pct", 1, 15),
        CanaryStage("canary-10pct", 10, 15),
        CanaryStage("canary-50pct", 50, 15),
        CanaryStage("full-rollout", 100, 0),
    ])
    
    # Thresholds for automated analysis
    max_error_rate_increase: float = 0.01  # 1% absolute increase
    max_latency_increase_pct: float = 0.10  # 10% relative increase
    min_request_count: int = 100  # Minimum requests before analysis


@dataclass
class CanaryAnalysis:
    """Result of canary analysis."""
    passed: bool
    stable_error_rate: float
    canary_error_rate: float
    stable_p99_latency: float
    canary_p99_latency: float
    request_count: int
    reason: Optional[str] = None


class CanaryDeployer:
    """
    Orchestrates canary deployments with automated analysis.
    """
    
    def __init__(
        self,
        kubernetes_client,
        metrics_client,
        traffic_manager,
        config: CanaryConfig = None
    ):
        self.k8s = kubernetes_client
        self.metrics = metrics_client
        self.traffic = traffic_manager
        self.config = config or CanaryConfig()
    
    async def deploy(
        self,
        version: str,
        deployment_name: str,
        namespace: str = "default"
    ) -> CanaryResult:
        """
        Execute a canary deployment.
        """
        logger.info(f"Starting canary deployment of {version}")
        
        result = CanaryResult(
            version=version,
            started_at=datetime.utcnow(),
            stages_completed=[]
        )
        
        try:
            # Deploy canary pods (initially with 0% traffic)
            await self._deploy_canary_pods(version, deployment_name, namespace)
            
            # Progress through stages
            for stage in self.config.stages:
                logger.info(f"Entering stage: {stage.name} ({stage.traffic_percentage}%)")
                
                # Shift traffic
                await self.traffic.set_canary_weight(
                    deployment_name,
                    stage.traffic_percentage
                )
                
                # Wait for stage duration
                if stage.duration_minutes > 0:
                    await asyncio.sleep(stage.duration_minutes * 60)
                    
                    # Analyze metrics
                    analysis = await self._analyze_canary(deployment_name)
                    
                    if not analysis.passed:
                        logger.warning(f"Canary failed at {stage.name}: {analysis.reason}")
                        await self._rollback(deployment_name, namespace)
                        result.failed = True
                        result.failure_reason = analysis.reason
                        result.failure_stage = stage.name
                        return result
                    
                    logger.info(f"Stage {stage.name} passed analysis")
                
                result.stages_completed.append(stage.name)
            
            # Promote canary to stable
            await self._promote_canary(deployment_name, namespace)
            
            result.completed_at = datetime.utcnow()
            result.success = True
            
            logger.info(f"Canary deployment complete: {version}")
            
            return result
            
        except Exception as e:
            logger.error(f"Canary deployment failed: {e}")
            await self._rollback(deployment_name, namespace)
            result.failed = True
            result.failure_reason = str(e)
            return result
    
    async def _analyze_canary(self, deployment_name: str) -> CanaryAnalysis:
        """
        Compare canary metrics against stable.
        """
        # Get error rates
        stable_errors = await self.metrics.query(f'''
            sum(rate(http_requests_total{{
                deployment="{deployment_name}",
                version="stable",
                status=~"5.."
            }}[5m])) /
            sum(rate(http_requests_total{{
                deployment="{deployment_name}",
                version="stable"
            }}[5m]))
        ''')
        
        canary_errors = await self.metrics.query(f'''
            sum(rate(http_requests_total{{
                deployment="{deployment_name}",
                version="canary",
                status=~"5.."
            }}[5m])) /
            sum(rate(http_requests_total{{
                deployment="{deployment_name}",
                version="canary"
            }}[5m]))
        ''')
        
        # Get latencies
        stable_latency = await self.metrics.query(f'''
            histogram_quantile(0.99,
                sum(rate(http_request_duration_seconds_bucket{{
                    deployment="{deployment_name}",
                    version="stable"
                }}[5m])) by (le)
            )
        ''')
        
        canary_latency = await self.metrics.query(f'''
            histogram_quantile(0.99,
                sum(rate(http_request_duration_seconds_bucket{{
                    deployment="{deployment_name}",
                    version="canary"
                }}[5m])) by (le)
            )
        ''')
        
        # Get request count
        request_count = await self.metrics.query(f'''
            sum(increase(http_requests_total{{
                deployment="{deployment_name}",
                version="canary"
            }}[5m]))
        ''')
        
        # Check thresholds
        analysis = CanaryAnalysis(
            passed=True,
            stable_error_rate=stable_errors,
            canary_error_rate=canary_errors,
            stable_p99_latency=stable_latency,
            canary_p99_latency=canary_latency,
            request_count=int(request_count)
        )
        
        # Not enough traffic to analyze
        if analysis.request_count < self.config.min_request_count:
            logger.warning(f"Not enough requests for analysis: {analysis.request_count}")
            return analysis  # Pass by default if not enough data
        
        # Check error rate
        error_increase = canary_errors - stable_errors
        if error_increase > self.config.max_error_rate_increase:
            analysis.passed = False
            analysis.reason = (
                f"Error rate increased by {error_increase:.2%} "
                f"(threshold: {self.config.max_error_rate_increase:.2%})"
            )
            return analysis
        
        # Check latency
        if stable_latency > 0:
            latency_increase = (canary_latency - stable_latency) / stable_latency
            if latency_increase > self.config.max_latency_increase_pct:
                analysis.passed = False
                analysis.reason = (
                    f"Latency increased by {latency_increase:.2%} "
                    f"(threshold: {self.config.max_latency_increase_pct:.2%})"
                )
                return analysis
        
        return analysis
    
    async def _rollback(self, deployment_name: str, namespace: str):
        """Rollback canary deployment."""
        logger.info(f"Rolling back canary for {deployment_name}")
        
        # Remove canary traffic
        await self.traffic.set_canary_weight(deployment_name, 0)
        
        # Delete canary pods
        await self.k8s.delete_canary_deployment(deployment_name, namespace)
        
        logger.info("Rollback complete")
    
    async def _promote_canary(self, deployment_name: str, namespace: str):
        """Promote canary to stable."""
        logger.info(f"Promoting canary to stable for {deployment_name}")
        
        # Update stable deployment with canary version
        await self.k8s.promote_canary(deployment_name, namespace)
        
        # Remove canary traffic split
        await self.traffic.remove_canary_route(deployment_name)
        
        logger.info("Promotion complete")

3.3 Canary vs Blue-Green

CANARY VS BLUE-GREEN COMPARISON

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  ASPECT              │ BLUE-GREEN          │ CANARY                    │
│  ────────────────────┼─────────────────────┼─────────────────────────  │
│  Traffic switch      │ All at once (100%)  │ Gradual (1% → 100%)       │
│  Risk exposure       │ All users at once   │ Limited users first       │
│  Rollback speed      │ Instant             │ Fast (seconds)            │
│  Infrastructure cost │ 2x during deploy    │ 1x + small overhead       │
│  Complexity          │ Medium              │ High                      │
│  Comparison testing  │ Before switch only  │ Side-by-side continuous   │
│  Best for            │ Quick full cutover  │ Gradual risk reduction    │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  WHEN TO USE BLUE-GREEN:                                               │
│  ├── Need instant rollback capability                                  │
│  ├── Changes are well-tested                                           │
│  ├── Want clean environment separation                                 │
│  └── Can afford 2x infrastructure                                      │
│                                                                        │
│  WHEN TO USE CANARY:                                                   │
│  ├── Want to minimize blast radius                                     │
│  ├── Need production comparison metrics                                │
│  ├── High-traffic services (can't risk all users)                      │
│  └── Automated analysis/rollback desired                               │
│                                                                        │
│  ADVANCED: COMBINE THEM                                                │
│  ├── Blue-Green for environments                                       │
│  ├── Canary for traffic within an environment                          │
│  └── Best of both worlds                                               │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Chapter 4: Feature Flags

4.1 Feature Flags Concept

FEATURE FLAGS

Decouple DEPLOYMENT from RELEASE.

Traditional:
  Deploy code → Feature is live
  
  Problem: Deploy is risky because features go live immediately

With Feature Flags:
  Deploy code → Feature is OFF
  Turn on flag → Feature is live
  
  Benefit: Deploy anytime, release when ready

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  FEATURE FLAG TYPES                                                    │
│                                                                        │
│  1. RELEASE FLAGS                                                      │
│     Control feature rollout                                            │
│     ├── New checkout flow: OFF                                         │
│     ├── Enable for 10% of users                                        │
│     ├── Enable for beta users                                          │
│     └── Enable for everyone                                            │
│                                                                        │
│  2. EXPERIMENT FLAGS                                                   │
│     A/B testing                                                        │
│     ├── Button color: "blue" vs "green"                                │
│     ├── Pricing display: "monthly" vs "annual"                         │
│     └── Measure conversion rates                                       │
│                                                                        │
│  3. OPS FLAGS                                                          │
│     Operational controls                                               │
│     ├── Enable/disable expensive feature under load                    │
│     ├── Circuit breaker for external services                          │
│     └── Kill switch for problematic code                               │
│                                                                        │
│  4. PERMISSION FLAGS                                                   │
│     User-specific features                                             │
│     ├── Premium features for paid users                                │
│     ├── Beta access for selected users                                 │
│     └── Admin features                                                 │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

4.2 Feature Flag Implementation

# deployment/feature_flags.py

"""
Feature flag system for controlled rollouts.

Supports:
- Boolean flags (on/off)
- Percentage rollouts
- User targeting
- Environment targeting
"""

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Callable
from enum import Enum
import hashlib
import logging

logger = logging.getLogger(__name__)


class RolloutStrategy(Enum):
    ALL = "all"           # Flag applies to everyone
    NONE = "none"         # Flag applies to no one
    PERCENTAGE = "percentage"  # Random percentage of users
    USER_LIST = "user_list"    # Specific users
    ATTRIBUTE = "attribute"     # Based on user attributes


@dataclass
class FeatureFlag:
    """A feature flag definition."""
    key: str
    name: str
    description: str
    
    # Targeting
    strategy: RolloutStrategy
    percentage: int = 0  # For PERCENTAGE strategy
    user_list: List[str] = field(default_factory=list)  # For USER_LIST
    attribute_rules: List[Dict] = field(default_factory=list)  # For ATTRIBUTE
    
    # Metadata
    owner: str = ""
    created_at: datetime = None
    expires_at: Optional[datetime] = None
    
    # Default value
    default_value: bool = False
    
    # Variants (for A/B tests)
    variants: Dict[str, Any] = field(default_factory=dict)


@dataclass
class EvaluationContext:
    """Context for evaluating a feature flag."""
    user_id: Optional[str] = None
    tenant_id: Optional[str] = None
    environment: str = "production"
    attributes: Dict[str, Any] = field(default_factory=dict)


class FeatureFlagService:
    """
    Feature flag evaluation service.
    """
    
    def __init__(self, flag_store, metrics_client):
        self.store = flag_store
        self.metrics = metrics_client
        self._cache = {}
    
    def is_enabled(
        self,
        flag_key: str,
        context: EvaluationContext = None
    ) -> bool:
        """
        Check if a feature flag is enabled.
        
        Usage:
            if feature_flags.is_enabled("new_checkout", context):
                return new_checkout_flow()
            else:
                return old_checkout_flow()
        """
        context = context or EvaluationContext()
        
        try:
            flag = self._get_flag(flag_key)
            
            if flag is None:
                logger.warning(f"Unknown flag: {flag_key}")
                return False
            
            # Check if expired
            if flag.expires_at and datetime.utcnow() > flag.expires_at:
                return flag.default_value
            
            # Evaluate based on strategy
            result = self._evaluate(flag, context)
            
            # Record metric
            self.metrics.increment(
                "feature_flag_evaluation_total",
                labels={
                    "flag": flag_key,
                    "result": str(result),
                    "strategy": flag.strategy.value
                }
            )
            
            return result
            
        except Exception as e:
            logger.error(f"Error evaluating flag {flag_key}: {e}")
            return False
    
    def get_variant(
        self,
        flag_key: str,
        context: EvaluationContext = None
    ) -> Optional[str]:
        """
        Get variant for A/B test flags.
        
        Usage:
            variant = feature_flags.get_variant("button_color", context)
            # Returns "blue" or "green" based on user bucketing
        """
        context = context or EvaluationContext()
        flag = self._get_flag(flag_key)
        
        if flag is None or not flag.variants:
            return None
        
        # Consistent hashing to assign user to variant
        bucket = self._get_bucket(flag_key, context.user_id)
        
        # Distribute buckets across variants
        cumulative = 0
        for variant_name, percentage in flag.variants.items():
            cumulative += percentage
            if bucket < cumulative:
                return variant_name
        
        return list(flag.variants.keys())[0]  # Default to first variant
    
    def _evaluate(self, flag: FeatureFlag, context: EvaluationContext) -> bool:
        """Evaluate flag based on its strategy."""
        
        if flag.strategy == RolloutStrategy.ALL:
            return True
        
        if flag.strategy == RolloutStrategy.NONE:
            return False
        
        if flag.strategy == RolloutStrategy.PERCENTAGE:
            return self._evaluate_percentage(flag, context)
        
        if flag.strategy == RolloutStrategy.USER_LIST:
            return context.user_id in flag.user_list
        
        if flag.strategy == RolloutStrategy.ATTRIBUTE:
            return self._evaluate_attributes(flag, context)
        
        return flag.default_value
    
    def _evaluate_percentage(
        self,
        flag: FeatureFlag,
        context: EvaluationContext
    ) -> bool:
        """
        Percentage-based evaluation with consistent hashing.
        
        Same user always gets same result for same flag.
        """
        bucket = self._get_bucket(flag.key, context.user_id)
        return bucket < flag.percentage
    
    def _get_bucket(self, flag_key: str, user_id: str) -> int:
        """
        Get consistent bucket (0-99) for user and flag.
        
        Same user + flag = same bucket, always.
        """
        if not user_id:
            # Random for anonymous users
            import random
            return random.randint(0, 99)
        
        # Hash user_id + flag_key for consistent bucketing
        hash_input = f"{flag_key}:{user_id}"
        hash_value = hashlib.md5(hash_input.encode()).hexdigest()
        return int(hash_value[:8], 16) % 100
    
    def _evaluate_attributes(
        self,
        flag: FeatureFlag,
        context: EvaluationContext
    ) -> bool:
        """Evaluate attribute-based rules."""
        for rule in flag.attribute_rules:
            attribute = rule.get("attribute")
            operator = rule.get("operator")
            value = rule.get("value")
            
            user_value = context.attributes.get(attribute)
            
            if operator == "equals" and user_value == value:
                return True
            if operator == "in" and user_value in value:
                return True
            if operator == "greater_than" and user_value > value:
                return True
        
        return flag.default_value
    
    def _get_flag(self, flag_key: str) -> Optional[FeatureFlag]:
        """Get flag from store with caching."""
        if flag_key in self._cache:
            return self._cache[flag_key]
        
        flag = self.store.get(flag_key)
        if flag:
            self._cache[flag_key] = flag
        
        return flag


# =============================================================================
# USAGE EXAMPLES
# =============================================================================

# Initialize
feature_flags = FeatureFlagService(flag_store, metrics)

# Simple boolean check
if feature_flags.is_enabled("new_checkout"):
    process_with_new_checkout()
else:
    process_with_old_checkout()

# With user context
context = EvaluationContext(
    user_id="user-123",
    tenant_id="tenant-456",
    attributes={
        "plan": "enterprise",
        "country": "US"
    }
)

if feature_flags.is_enabled("advanced_analytics", context):
    show_advanced_analytics()

# A/B test
button_color = feature_flags.get_variant("button_color_test", context)
render_button(color=button_color)

4.3 Feature Flag Best Practices

FEATURE FLAG BEST PRACTICES

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  1. HAVE A LIFECYCLE                                                   │
│  ───────────────────                                                   │
│  Flags should be temporary!                                            │
│                                                                        │
│  Lifecycle:                                                            │
│  ├── Create flag (with owner and expiration)                           │
│  ├── Roll out gradually                                                │
│  ├── Enable for 100%                                                   │
│  ├── REMOVE FLAG from code                                             │
│  └── Archive flag                                                      │
│                                                                        │
│  Stale flags are tech debt.                                            │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  2. KEEP FLAGS SIMPLE                                                  │
│  ────────────────────                                                  │
│  Flags should gate features, not business logic.                       │
│                                                                        │
│  ✅ GOOD: feature_flags.is_enabled("new_search")                       │
│  ❌ BAD:  feature_flags.get_value("search_algorithm_config")           │
│                                                                        │
│  Complex configuration should be in config files, not flags.           │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  3. TEST BOTH PATHS                                                    │
│  ──────────────────                                                    │
│  Every flag creates a code branch. Test both.                          │
│                                                                        │
│  def test_feature_enabled():                                           │
│      with flag_enabled("new_checkout"):                                │
│          assert checkout() == expected_new                             │
│                                                                        │
│  def test_feature_disabled():                                          │
│      with flag_disabled("new_checkout"):                               │
│          assert checkout() == expected_old                             │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  4. HAVE KILL SWITCHES                                                 │
│  ─────────────────────                                                 │
│  Some flags should always exist for operational control.               │
│                                                                        │
│  Examples:                                                             │
│  ├── disable_expensive_queries                                         │
│  ├── enable_read_only_mode                                             │
│  ├── disable_external_integrations                                     │
│  └── enable_maintenance_mode                                           │
│                                                                        │
│  These are not temporary — they're operational controls.               │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  5. MONITOR FLAG USAGE                                                 │
│  ─────────────────────                                                 │
│  Track:                                                                │
│  ├── How often each flag is evaluated                                  │
│  ├── Which flags are always true/false (candidates for removal)        │
│  ├── Which flags are never evaluated (dead code)                       │
│  └── Errors in flag evaluation                                         │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Part II: Advanced Topics

Chapter 5: Database Migrations

5.1 The Database Migration Problem

THE DATABASE MIGRATION CHALLENGE

Code deployments are (relatively) easy:
├── Deploy new code
├── If broken, rollback to old code
└── Done

Database changes are HARD:
├── Can't just "rollback" a migration
├── Schema changes affect running code
├── Data migrations can take hours
├── Both old and new code must work with DB
└── Mistakes can cause data loss

The key insight:
During deployment, BOTH versions of code are running.
Your database must support BOTH versions.

5.2 Expand-Contract Pattern

EXPAND-CONTRACT (PARALLEL CHANGE) PATTERN

The safe way to make breaking schema changes.

Example: Rename column "name" to "full_name"

❌ WRONG: Direct rename
  1. Deploy migration: ALTER TABLE users RENAME COLUMN name TO full_name;
  2. Deploy code that uses "full_name"
  
  Problem: During deployment, old code (using "name") and new code
  (using "full_name") are both running. One of them will break.

✅ RIGHT: Expand-Contract (3-phase migration)

PHASE 1: EXPAND
  - Add new column
  - Write to both columns
  - Read from old column
  
  Migration:
    ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
  
  Code:
    # Write to both
    UPDATE users SET name = $1, full_name = $1 WHERE id = $2
    # Read from old
    SELECT name FROM users WHERE id = $1
  
  Deploy code first, then run migration.
  Both old and new code work.

PHASE 2: MIGRATE
  - Backfill data
  - Switch reads to new column
  - Continue writing to both
  
  Migration:
    UPDATE users SET full_name = name WHERE full_name IS NULL;
  
  Code:
    # Write to both (for safety)
    UPDATE users SET name = $1, full_name = $1 WHERE id = $2
    # Read from new
    SELECT full_name FROM users WHERE id = $1
  
  Deploy code first, then run migration.

PHASE 3: CONTRACT
  - Stop writing to old column
  - Remove old column
  
  Code:
    # Write to new only
    UPDATE users SET full_name = $1 WHERE id = $2
    # Read from new
    SELECT full_name FROM users WHERE id = $1
  
  Migration (after all old code is gone):
    ALTER TABLE users DROP COLUMN name;

Timeline:
  ═══════════════════════════════════════════════════════════════════►
  │         │              │              │              │
  │ Original│   Expand     │   Migrate    │   Contract   │
  │         │              │              │              │
  │ name    │ name +       │ name +       │ full_name    │
  │ only    │ full_name    │ full_name    │ only         │
  │         │ (write both) │ (read new)   │              │

5.3 Migration Implementation

# deployment/migrations.py

"""
Safe database migration patterns.

Implements expand-contract for breaking changes.
"""

from dataclasses import dataclass
from enum import Enum
from typing import List, Optional
import logging

logger = logging.getLogger(__name__)


class MigrationPhase(Enum):
    EXPAND = "expand"
    MIGRATE = "migrate"
    CONTRACT = "contract"


@dataclass
class Migration:
    """A database migration."""
    version: str
    name: str
    phase: MigrationPhase
    
    # SQL to run
    up_sql: str
    down_sql: str
    
    # Validation
    pre_check_sql: Optional[str] = None  # Must pass before running
    post_check_sql: Optional[str] = None  # Must pass after running
    
    # Safety
    requires_lock: bool = False  # Requires table lock
    estimated_duration: str = ""  # e.g., "5 minutes", "2 hours"
    affects_writes: bool = False  # Will block writes


class SafeMigrationRunner:
    """
    Runs migrations safely with validation and rollback support.
    """
    
    def __init__(self, db_connection, lock_manager):
        self.db = db_connection
        self.locks = lock_manager
    
    async def run_migration(
        self,
        migration: Migration,
        dry_run: bool = False
    ) -> MigrationResult:
        """
        Run a migration with safety checks.
        """
        logger.info(f"Starting migration: {migration.name}")
        
        result = MigrationResult(
            migration=migration,
            started_at=datetime.utcnow()
        )
        
        try:
            # Pre-check
            if migration.pre_check_sql:
                check_result = await self.db.fetch_one(migration.pre_check_sql)
                if not check_result or not check_result[0]:
                    raise MigrationError("Pre-check failed")
            
            # Acquire lock if needed
            if migration.requires_lock:
                await self.locks.acquire("migration_lock")
            
            # Run migration
            if not dry_run:
                await self.db.execute(migration.up_sql)
            else:
                logger.info(f"DRY RUN: Would execute: {migration.up_sql}")
            
            # Post-check
            if migration.post_check_sql:
                check_result = await self.db.fetch_one(migration.post_check_sql)
                if not check_result or not check_result[0]:
                    raise MigrationError("Post-check failed, rolling back")
            
            # Record success
            await self._record_migration(migration)
            
            result.success = True
            result.completed_at = datetime.utcnow()
            
            logger.info(f"Migration complete: {migration.name}")
            
            return result
            
        except Exception as e:
            logger.error(f"Migration failed: {e}")
            
            # Attempt rollback
            if not dry_run:
                try:
                    await self.db.execute(migration.down_sql)
                    logger.info("Rollback successful")
                except Exception as rollback_error:
                    logger.error(f"Rollback failed: {rollback_error}")
            
            result.success = False
            result.error = str(e)
            
            return result
            
        finally:
            if migration.requires_lock:
                await self.locks.release("migration_lock")


# =============================================================================
# EXAMPLE: RENAME COLUMN MIGRATION SET
# =============================================================================

RENAME_COLUMN_MIGRATIONS = [
    # Phase 1: Expand - Add new column
    Migration(
        version="2024_01_15_001",
        name="add_full_name_column",
        phase=MigrationPhase.EXPAND,
        up_sql="""
            ALTER TABLE users 
            ADD COLUMN full_name VARCHAR(255);
            
            -- Trigger to keep columns in sync during transition
            CREATE OR REPLACE FUNCTION sync_name_columns()
            RETURNS TRIGGER AS $$
            BEGIN
                IF NEW.name IS DISTINCT FROM OLD.name THEN
                    NEW.full_name := NEW.name;
                ELSIF NEW.full_name IS DISTINCT FROM OLD.full_name THEN
                    NEW.name := NEW.full_name;
                END IF;
                RETURN NEW;
            END;
            $$ LANGUAGE plpgsql;
            
            CREATE TRIGGER sync_names
            BEFORE UPDATE ON users
            FOR EACH ROW EXECUTE FUNCTION sync_name_columns();
        """,
        down_sql="""
            DROP TRIGGER IF EXISTS sync_names ON users;
            DROP FUNCTION IF EXISTS sync_name_columns();
            ALTER TABLE users DROP COLUMN IF EXISTS full_name;
        """,
        post_check_sql="""
            SELECT EXISTS (
                SELECT 1 FROM information_schema.columns 
                WHERE table_name = 'users' AND column_name = 'full_name'
            );
        """
    ),
    
    # Phase 2: Migrate - Backfill data
    Migration(
        version="2024_01_16_001",
        name="backfill_full_name",
        phase=MigrationPhase.MIGRATE,
        up_sql="""
            -- Backfill in batches to avoid locking
            DO $$
            DECLARE
                batch_size INT := 10000;
                updated INT;
            BEGIN
                LOOP
                    UPDATE users 
                    SET full_name = name 
                    WHERE full_name IS NULL 
                    AND id IN (
                        SELECT id FROM users 
                        WHERE full_name IS NULL 
                        LIMIT batch_size
                    );
                    
                    GET DIAGNOSTICS updated = ROW_COUNT;
                    EXIT WHEN updated = 0;
                    
                    -- Small pause between batches
                    PERFORM pg_sleep(0.1);
                END LOOP;
            END $$;
        """,
        down_sql="-- No rollback needed for backfill",
        post_check_sql="""
            SELECT COUNT(*) = 0 
            FROM users 
            WHERE full_name IS NULL AND name IS NOT NULL;
        """,
        estimated_duration="30 minutes for 1M rows"
    ),
    
    # Phase 3: Contract - Remove old column (after code fully migrated)
    Migration(
        version="2024_01_20_001",
        name="remove_name_column",
        phase=MigrationPhase.CONTRACT,
        up_sql="""
            DROP TRIGGER IF EXISTS sync_names ON users;
            DROP FUNCTION IF EXISTS sync_name_columns();
            ALTER TABLE users DROP COLUMN name;
        """,
        down_sql="""
            ALTER TABLE users ADD COLUMN name VARCHAR(255);
            UPDATE users SET name = full_name;
        """,
        pre_check_sql="""
            -- Ensure no code is reading from 'name' column
            -- This should be checked via application logs/metrics
            SELECT true;
        """,
        requires_lock=True
    ),
]

Chapter 6: Rollback Strategies

6.1 Types of Rollback

ROLLBACK STRATEGIES

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  1. CODE ROLLBACK                                                      │
│  ─────────────────                                                     │
│  Roll back to previous code version.                                   │
│                                                                        │
│  Methods:                                                              │
│  ├── kubectl rollout undo deployment/myapp                             │
│  ├── Deploy previous version tag                                       │
│  ├── Blue-green: switch back to previous environment                   │
│  └── Revert git commit, redeploy                                       │
│                                                                        │
│  Speed: Fast (seconds to minutes)                                      │
│  Risk: Low if code is backward compatible                              │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  2. FEATURE ROLLBACK                                                   │
│  ────────────────────                                                  │
│  Turn off feature flag without code change.                            │
│                                                                        │
│  Methods:                                                              │
│  ├── Flip flag in feature flag service                                 │
│  ├── No deployment needed                                              │
│  └── Instant effect                                                    │
│                                                                        │
│  Speed: Instant (seconds)                                              │
│  Risk: Very low                                                        │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  3. CONFIG ROLLBACK                                                    │
│  ──────────────────                                                    │
│  Revert configuration change.                                          │
│                                                                        │
│  Methods:                                                              │
│  ├── Update ConfigMap/environment variables                            │
│  ├── Rolling restart to pick up new config                             │
│  └── Some systems support hot reload                                   │
│                                                                        │
│  Speed: Minutes                                                        │
│  Risk: Low to medium                                                   │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  4. DATA ROLLBACK                                                      │
│  ─────────────────                                                     │
│  Revert data changes. THE HARDEST TYPE.                                │
│                                                                        │
│  Methods:                                                              │
│  ├── Restore from backup (slow, nuclear option)                        │
│  ├── Run compensating transactions                                     │
│  ├── Point-in-time recovery                                            │
│  └── Manual data fixes                                                 │
│                                                                        │
│  Speed: Slow (minutes to hours)                                        │
│  Risk: High (potential data loss)                                      │
│                                                                        │
│  BEST APPROACH: Avoid needing data rollback                            │
│  ├── Test migrations thoroughly                                        │
│  ├── Use expand-contract pattern                                       │
│  ├── Make changes reversible                                           │
│  └── Take backups before risky changes                                 │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

6.2 Automated Rollback

# deployment/auto_rollback.py

"""
Automated rollback based on metrics.

If deployment causes problems, automatically roll back.
"""

from dataclasses import dataclass
from typing import Optional
import asyncio
import logging

logger = logging.getLogger(__name__)


@dataclass
class RollbackConfig:
    """Configuration for automated rollback."""
    # Metric thresholds
    max_error_rate: float = 0.05  # 5%
    max_latency_p99_ms: int = 1000  # 1 second
    min_success_rate: float = 0.95  # 95%
    
    # Timing
    observation_window_seconds: int = 300  # 5 minutes
    check_interval_seconds: int = 30
    
    # Behavior
    auto_rollback_enabled: bool = True
    require_confirmation: bool = False  # For critical services


class AutomatedRollbackMonitor:
    """
    Monitors deployments and triggers rollback if metrics degrade.
    """
    
    def __init__(
        self,
        metrics_client,
        deployer,
        alerter,
        config: RollbackConfig = None
    ):
        self.metrics = metrics_client
        self.deployer = deployer
        self.alerter = alerter
        self.config = config or RollbackConfig()
    
    async def monitor_deployment(
        self,
        deployment_id: str,
        baseline_metrics: dict
    ):
        """
        Monitor a deployment and rollback if needed.
        """
        logger.info(f"Starting rollback monitor for deployment {deployment_id}")
        
        start_time = datetime.utcnow()
        end_time = start_time + timedelta(seconds=self.config.observation_window_seconds)
        
        while datetime.utcnow() < end_time:
            await asyncio.sleep(self.config.check_interval_seconds)
            
            # Check metrics
            current_metrics = await self._get_current_metrics()
            
            # Compare to baseline and thresholds
            should_rollback, reason = self._should_rollback(
                baseline_metrics,
                current_metrics
            )
            
            if should_rollback:
                logger.warning(f"Rollback triggered: {reason}")
                
                if self.config.require_confirmation:
                    # Alert and wait for confirmation
                    await self.alerter.send_critical(
                        f"Deployment {deployment_id} may need rollback: {reason}. "
                        f"Reply 'ROLLBACK' to confirm."
                    )
                    # In a real system, this would wait for response
                elif self.config.auto_rollback_enabled:
                    await self._execute_rollback(deployment_id, reason)
                    return RollbackResult(
                        rolled_back=True,
                        reason=reason,
                        metrics=current_metrics
                    )
        
        logger.info(f"Deployment {deployment_id} passed monitoring period")
        return RollbackResult(rolled_back=False, metrics=current_metrics)
    
    def _should_rollback(
        self,
        baseline: dict,
        current: dict
    ) -> tuple[bool, Optional[str]]:
        """
        Determine if rollback is needed.
        """
        # Check absolute thresholds
        if current['error_rate'] > self.config.max_error_rate:
            return True, f"Error rate {current['error_rate']:.2%} exceeds threshold"
        
        if current['latency_p99_ms'] > self.config.max_latency_p99_ms:
            return True, f"Latency {current['latency_p99_ms']}ms exceeds threshold"
        
        if current['success_rate'] < self.config.min_success_rate:
            return True, f"Success rate {current['success_rate']:.2%} below threshold"
        
        # Check relative degradation vs baseline
        if baseline:
            error_increase = current['error_rate'] - baseline['error_rate']
            if error_increase > 0.02:  # 2% absolute increase
                return True, f"Error rate increased by {error_increase:.2%}"
            
            latency_increase_pct = (
                (current['latency_p99_ms'] - baseline['latency_p99_ms']) /
                baseline['latency_p99_ms']
            ) if baseline['latency_p99_ms'] > 0 else 0
            
            if latency_increase_pct > 0.5:  # 50% increase
                return True, f"Latency increased by {latency_increase_pct:.0%}"
        
        return False, None
    
    async def _execute_rollback(self, deployment_id: str, reason: str):
        """Execute the rollback."""
        logger.warning(f"Executing rollback for {deployment_id}")
        
        # Send alert
        await self.alerter.send_critical(
            f"AUTO-ROLLBACK executed for deployment {deployment_id}. "
            f"Reason: {reason}"
        )
        
        # Execute rollback
        await self.deployer.rollback(deployment_id)
        
        logger.info(f"Rollback complete for {deployment_id}")

Part III: Real-World Application

Chapter 7: Case Studies

7.1 How Amazon Deploys

AMAZON'S DEPLOYMENT PRACTICES

Scale:
├── Thousands of deployments per day
├── Thousands of services
├── Millions of servers
└── Zero-downtime expectation

Key Practices:

1. ONE-BOX DEPLOYMENT
   ├── Deploy to single instance first
   ├── Run tests and monitor
   ├── If healthy, proceed to wider rollout
   └── If not, automatic rollback

2. WAVE DEPLOYMENTS
   ├── Deploy in waves: 1% → 5% → 10% → 25% → 50% → 100%
   ├── Bake time between waves
   ├── Automated health checks between waves
   └── Automatic rollback if metrics degrade

3. REGION-BY-REGION
   ├── Deploy to one region first
   ├── Verify in production
   ├── Then roll out to other regions
   └── Can quickly isolate issues

4. FEATURE FLAGS
   ├── Deploy code dark
   ├── Enable features gradually
   ├── A/B testing for new features
   └── Kill switches for all major features

5. APOLLO (Config Management)
   ├── Separate config from code
   ├── Config changes without deploy
   ├── Instant config propagation
   └── Versioned config with rollback

Lessons:
├── Small changes are safer than big changes
├── Automate everything
├── Monitor everything
├── Roll back fast
└── Deploy often to reduce batch size

7.2 How Google Does It

GOOGLE'S DEPLOYMENT APPROACH

Philosophy:
"Make rollbacks easy and deployments boring."

Key Systems:

1. BORG/KUBERNETES
   ├── Declarative deployment
   ├── Automated rollout/rollback
   ├── Health checks built-in
   └── Self-healing

2. STAGED ROLLOUTS
   ├── Canary first
   ├── Automated analysis
   ├── Progressive percentage increase
   └── Cross-cluster, cross-region

3. RELEASE TRAINS
   ├── Regular release cadence
   ├── Features either make the train or wait
   ├── Predictable release schedule
   └── Reduces urgency and risk

4. BINARY VS CONFIG
   ├── Binary releases: Weekly or less frequent
   ├── Config changes: Anytime
   ├── Most "releases" are config changes
   └── Reduces code deployment risk

5. CANARYING EVERYTHING
   ├── Code changes: Canary
   ├── Config changes: Canary
   ├── Capacity changes: Canary
   └── "Everything that can break should canary"

SRE Integration:
├── SRE team owns deployment tools
├── Error budget gates deployments
├── If SLO violated, deployment paused
└── Reliability is a gating criterion

Chapter 8: Common Mistakes

DEPLOYMENT ANTI-PATTERNS

❌ MISTAKE 1: Big Bang Deployments

Wrong:
  ├── Save up changes for months
  ├── Deploy everything at once
  └── Hope for the best

Problems:
  ├── Huge blast radius if something breaks
  ├── Hard to identify which change caused issues
  └── Rollback affects all changes

Right:
  ├── Deploy frequently (daily or more)
  ├── Small, incremental changes
  └── Easy to identify and rollback specific changes


❌ MISTAKE 2: No Rollback Plan

Wrong:
  "We'll figure out rollback if we need it"

Problems:
  ├── Panic when things break
  ├── Unclear process under pressure
  └── Longer outages

Right:
  ├── Test rollback BEFORE you need it
  ├── Document rollback procedure
  ├── Automate if possible
  └── Practice regularly


❌ MISTAKE 3: Database Changes with Code

Wrong:
  ├── Code and schema change in same deployment
  ├── Migration runs during deployment
  └── If either fails, both rollback needed

Problems:
  ├── Complex rollback
  ├── Data migration might not be reversible
  └── Extended downtime

Right:
  ├── Separate database changes from code changes
  ├── Use expand-contract pattern
  ├── Database changes first, then code
  └── Each can roll back independently


❌ MISTAKE 4: Deploy on Friday

Wrong:
  "Let's ship this before the weekend!"

Problems:
  ├── Reduced staffing if issues
  ├── Tired people making decisions
  └── Customer impact over weekend

Right:
  ├── Deploy early in the week
  ├── Deploy early in the day
  ├── Have full team available for monitoring
  └── Friday deployments only for emergencies


❌ MISTAKE 5: No Monitoring During Deploy

Wrong:
  ├── Deploy and walk away
  ├── Assume success
  └── Find out about problems from customers

Right:
  ├── Watch dashboards during deploy
  ├── Have alerts set for deployment metrics
  ├── Bake time with active monitoring
  └── Only declare success after observation period

Part IV: Interview Preparation

Chapter 9: Interview Tips

9.1 Deployment Discussion Framework

DISCUSSING DEPLOYMENTS IN INTERVIEWS

When asked "How would you deploy changes to this system?":

1. ASSESS RISK
   "First, I'd assess the risk of this change.
   Is it a small bug fix or a major feature?
   Does it touch critical paths like payments?
   Does it require database changes?"

2. CHOOSE STRATEGY
   "Based on risk, I'd choose a deployment strategy:
   - Low risk: Rolling deployment with monitoring
   - Medium risk: Canary with automated analysis
   - High risk: Blue-green with feature flag"

3. EXPLAIN SAFETY MECHANISMS
   "I'd ensure safety through:
   - Health checks before declaring success
   - Metrics comparison (canary vs stable)
   - Automated rollback if thresholds exceeded"

4. ADDRESS DATABASE CHANGES
   "For database changes, I'd use expand-contract:
   - Phase 1: Add new column/table
   - Phase 2: Backfill data
   - Phase 3: Switch code to use new schema
   - Phase 4: Remove old schema"

5. DISCUSS ROLLBACK
   "If something goes wrong:
   - Blue-green: Instant switch back
   - Canary: Remove canary traffic
   - Feature flag: Turn off flag
   - Code: Rollback to previous version"

9.2 Key Phrases

DEPLOYMENT KEY PHRASES

On Strategy Selection:
"I match deployment strategy to risk.
For a new payment feature, I'd use canary with
aggressive monitoring. For a UI tweak,
a standard rolling deployment is fine."

On Canary:
"Canary lets me compare the new version against
the stable version in production with real traffic.
If error rate increases or latency degrades,
I roll back before most users are affected."

On Feature Flags:
"Feature flags decouple deployment from release.
I can deploy code anytime, then enable the feature
gradually. If something's wrong, turning off
the flag is instant — no deployment needed."

On Database Migrations:
"I never make breaking schema changes directly.
I use expand-contract: add the new structure,
migrate data, switch the code, then remove
the old structure. This way, both old and new
code work at every step."

On Rollback:
"The first thing I consider with any change is:
how do I undo this? If I can't answer that,
I need to rethink the approach. Rollback
should be fast and safe."

Chapter 10: Practice Problems

Problem 1: E-commerce Checkout Deployment

Scenario: You're deploying a new checkout flow. How would you do it safely?

Questions:

  1. What deployment strategy would you use?
  2. What metrics would you monitor?
  3. What's your rollback plan?
  1. Strategy: Canary with feature flag

    • Deploy code with feature flag OFF
    • Enable for 1% of users, monitor
    • Gradually increase to 100%
  2. Metrics:

    • Checkout completion rate
    • Error rate
    • Payment success rate
    • Latency p99
    • Cart abandonment rate
  3. Rollback:

    • Turn off feature flag (instant)
    • If persistent issue, roll back code deployment

Problem 2: Database Column Rename

Scenario: You need to rename a column from email to primary_email. The table has 10 million rows.

Questions:

  1. What's the safest approach?
  2. What are the migration phases?
  3. How do you handle the transition period?
  1. Approach: Expand-Contract

    • Never directly rename a column
    • Three-phase migration
  2. Phases:

    • Expand: Add primary_email column
    • Migrate: Backfill data, switch code to use new column
    • Contract: Remove email column
  3. Transition:

    • Code writes to BOTH columns
    • Code reads from new column (with fallback to old)
    • Once all code updated, remove old column

Chapter 11: Sample Interview Dialogue

Interviewer: "You have a new feature that changes how we calculate shipping costs. How would you deploy it?"

You: "Let me think through this systematically. First, let me ask a few questions:

  • How critical is shipping cost accuracy? Are mistakes costly?
  • Is there a database component to this change?
  • How much traffic does this code path see?"

Interviewer: "It's a high-volume checkout flow. Mistakes in shipping costs directly impact revenue. There's a new table for shipping zones."

You: "Given the high risk, I'd use a careful approach:

Phase 1: Database changes first"

├── Create new shipping_zones table
├── Populate with data
├── No code changes yet
├── Verify data is correct

"Phase 2: Deploy code behind feature flag"

├── Deploy new calculation code with flag OFF
├── Both old and new code paths exist
├── No user impact yet

"Phase 3: Shadow testing"

├── Run new calculation alongside old
├── Log both results, compare
├── Look for discrepancies
├── Fix any issues found

"Phase 4: Gradual rollout"

├── Enable for internal users first
├── Then 1% of customers
├── Monitor:
│   ├── Shipping cost accuracy
│   ├── Checkout completion rate
│   ├── Customer complaints
│   └── Revenue metrics
├── Gradually increase to 100%

Interviewer: "What if you find a bug at 10% rollout?"

You: "I'd turn off the feature flag immediately. Users would get the old calculation. Since the old code is still there, this is instant.

Then I'd analyze the bug, fix it, and start the rollout again from shadow testing.

The beauty of feature flags is that rollback is just a configuration change — no deployment needed."

Interviewer: "Good. How would you handle a bug in the database changes?"

You: "That's trickier. Database changes are harder to roll back.

For the shipping_zones table:

  • It's additive (new table), so it doesn't break existing code
  • If the data is wrong, I can update it without schema changes
  • If the schema itself is wrong, I'd create a new table with the correct schema, migrate data, then drop the old one

The key is making database changes backward compatible. I never put code that depends on new schema in the same release as the schema change."


Summary

┌────────────────────────────────────────────────────────────────────────┐
│                    DAY 3 KEY TAKEAWAYS                                 │
│                                                                        │
│  DEPLOYMENT STRATEGIES:                                                │
│  ├── Rolling: Gradual replacement, no downtime                         │
│  ├── Blue-Green: Instant switch, instant rollback                      │
│  ├── Canary: Gradual traffic shift with comparison                     │
│  └── Feature Flags: Decouple deployment from release                   │
│                                                                        │
│  SAFETY MECHANISMS:                                                    │
│  ├── Health checks before proceeding                                   │
│  ├── Metric comparison (canary vs stable)                              │
│  ├── Automated rollback on threshold breach                            │
│  └── Observation periods between stages                                │
│                                                                        │
│  DATABASE MIGRATIONS:                                                  │
│  ├── Use expand-contract pattern                                       │
│  ├── Never make breaking changes directly                              │
│  ├── Separate schema changes from code changes                         │
│  └── Both old and new code must work during transition                 │
│                                                                        │
│  ROLLBACK:                                                             │
│  ├── Code: Deploy previous version (fast)                              │
│  ├── Feature: Turn off flag (instant)                                  │
│  ├── Config: Revert configuration (fast)                               │
│  └── Data: Restore from backup (slow, avoid)                           │
│                                                                        │
│  BEST PRACTICES:                                                       │
│  ├── Deploy frequently (reduces batch size)                            │
│  ├── Deploy early in the week (more support)                           │
│  ├── Monitor during deployment                                         │
│  ├── Have a rollback plan before deploying                             │
│  └── Automate everything possible                                      │
│                                                                        │
│  KEY INSIGHT:                                                          │
│  The safest deployment is a small, incremental one                     │
│  with automated monitoring and fast rollback.                          │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Further Reading

Books:

  • "Continuous Delivery" by Jez Humble & David Farley
  • "Accelerate" by Nicole Forsgren, Jez Humble, Gene Kim

Articles:

  • Martin Fowler: "Blue Green Deployment"
  • Google SRE Book: Chapter on Release Engineering
  • AWS: "Blue/Green Deployments on AWS"

Tools:

  • Kubernetes: Rolling updates, readiness probes
  • Istio/Linkerd: Traffic shifting for canary
  • LaunchDarkly/Split: Feature flag management
  • Argo Rollouts: Progressive delivery for Kubernetes

End of Day 3: Deployment Strategies

Tomorrow: Day 4 — Capacity Planning. You can deploy safely. But how do you know you have enough capacity? How do you prepare for traffic spikes? How do you scale before you need to?


You now have the trifecta: Define health (SLOs). See health (observability). Maintain health through change (deployments). Tomorrow, we ensure health under load.