Week 10 — Day 3: Deployment Strategies
System Design Mastery Series — Production Readiness and Operational Excellence
Preface
You can define what healthy means (SLOs). You can see if you're healthy (observability).
Now: how do you ship changes without breaking that health?
THE DEPLOYMENT PARADOX
To improve a system, you must change it.
To keep a system stable, you must not change it.
Every deployment is a risk:
├── New code might have bugs
├── New config might be wrong
├── New dependencies might fail
├── New scale might break assumptions
└── Humans might make mistakes
Yet we MUST deploy:
├── Fix bugs
├── Add features
├── Patch security
├── Improve performance
└── Stay competitive
THE SOLUTION:
Deploy frequently, but deploy SAFELY.
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ DEPLOYMENT PHILOSOPHY │
│ │
│ OLD WAY: │
│ ├── Deploy monthly (or quarterly!) │
│ ├── Big bang releases │
│ ├── "Change freeze" before releases │
│ ├── All-hands-on-deck deployment days │
│ └── Hope nothing breaks │
│ │
│ MODERN WAY: │
│ ├── Deploy daily (or more!) │
│ ├── Small, incremental changes │
│ ├── Automated pipelines │
│ ├── Gradual rollouts with monitoring │
│ └── Automatic rollback if problems detected │
│ │
│ The safest deployment is a small one. │
│ The safest rollback is a fast one. │
│ │
└────────────────────────────────────────────────────────────────────────┘
Today, we learn to ship changes without fear.
Part I: Foundations
Chapter 1: Deployment Strategies Overview
1.1 The Deployment Strategy Spectrum
DEPLOYMENT STRATEGIES
┌───────────────────────────────────────────────────────────────────────┐
│ │
│ RISK ◄─────────────────────────────────────────────────────► SAFETY │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Big │ │ Rolling │ │ Blue- │ │ Canary │ │ Feature │ │
│ │ Bang │ │ │ │ Green │ │ │ │ Flags │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ All at once Gradual Instant Gradual Code live, │
│ High risk replacement Switchover % rollout not active │
│ Fast Medium risk Fast rollback Safest Decoupled │
│ │
└───────────────────────────────────────────────────────────────────────┘
STRATEGY COMPARISON:
│ Strategy │ Downtime │ Rollback │ Resource Cost │ Complexity │
│─────────────│──────────│─────────────│───────────────│────────────│
│ Big Bang │ Yes │ Full redeploy│ 1x │ Low │
│ Rolling │ No │ Slow │ 1x + buffer │ Medium │
│ Blue-Green │ No │ Instant │ 2x │ Medium │
│ Canary │ No │ Fast │ 1x + small │ High │
│ Feature Flag│ No │ Instant │ 1x │ High │
1.2 Big Bang Deployment (Don't Do This)
BIG BANG DEPLOYMENT
What it is:
├── Stop old version
├── Deploy new version
├── Start new version
├── Hope it works
Timeline:
════════════════╦════════════════════════════════════════
v1.0 Running ║ DOWNTIME ║ v1.1 Running
════════════════╩═══════════╩════════════════════════════
↑ ↑
Stop v1 Start v1.1
Problems:
├── Downtime during deployment
├── All users hit new version at once
├── If broken, ALL users affected
├── Rollback requires another deployment
├── High stress, high risk
When it's acceptable:
├── Development environment
├── Non-critical internal tools
├── Scheduled maintenance windows
└── When other options truly impossible
For production services: AVOID.
1.3 Rolling Deployment
ROLLING DEPLOYMENT
What it is:
├── Gradually replace old instances with new
├── One (or few) at a time
├── Health checks before proceeding
├── No downtime
Timeline:
Instance 1: ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Instance 2: ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░
Instance 3: ████████████████████████████░░░░░░░░░░░░░░░░
Instance 4: ████████████████████████████████████░░░░░░░░
Instance 5: ████████████████████████████████████████████
═══════════════════════════════════════════►
████ = v1.0 ░░░░ = v1.1
Traffic distribution during rollout:
Start: 100% v1.0, 0% v1.1
25% done: 75% v1.0, 25% v1.1
50% done: 50% v1.0, 50% v1.1
75% done: 25% v1.0, 75% v1.1
Complete: 0% v1.0, 100% v1.1
Advantages:
├── No downtime
├── Gradual risk exposure
├── Can pause if issues detected
└── Natural load balancing
Disadvantages:
├── Two versions running simultaneously
├── Rollback is slow (reverse the process)
├── Database must support both versions
└── API must be backward compatible
Chapter 2: Blue-Green Deployment
2.1 How Blue-Green Works
BLUE-GREEN DEPLOYMENT
Architecture:
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ Load Balancer │
│ │ │
│ │ (switches between) │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ BLUE │ │ GREEN │ │
│ │ (Current) │ │ (New) │ │
│ │ │ │ │ │
│ │ ┌───────┐ │ │ ┌───────┐ │ │
│ │ │ App 1 │ │ │ │ App 1 │ │ │
│ │ ├───────┤ │ │ ├───────┤ │ │
│ │ │ App 2 │ │ │ │ App 2 │ │ │
│ │ ├───────┤ │ │ ├───────┤ │ │
│ │ │ App 3 │ │ │ │ App 3 │ │ │
│ │ └───────┘ │ │ └───────┘ │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ └───────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ Shared Database │
│ │
└────────────────────────────────────────────────────────────────────────┘
Deployment Process:
1. CURRENT STATE
├── Blue: Running v1.0, receiving traffic
└── Green: Idle (or running old version)
2. DEPLOY NEW VERSION
├── Blue: Still running v1.0, receiving traffic
└── Green: Deploy v1.1, run smoke tests
3. SWITCH TRAFFIC
├── Blue: Running v1.0, NO traffic
└── Green: Running v1.1, receiving ALL traffic
4. VERIFY
├── Monitor metrics, logs, errors
└── If problems: Switch back to Blue (instant rollback)
5. CLEANUP
└── Blue becomes the standby for next deployment
2.2 Blue-Green Implementation
# deployment/blue_green.py
"""
Blue-Green deployment orchestration.
This manages the switch between environments
and provides instant rollback capability.
"""
from dataclasses import dataclass
from enum import Enum
from typing import Optional
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
class Environment(Enum):
BLUE = "blue"
GREEN = "green"
class DeploymentState(Enum):
IDLE = "idle"
DEPLOYING = "deploying"
TESTING = "testing"
SWITCHING = "switching"
VERIFYING = "verifying"
COMPLETE = "complete"
ROLLED_BACK = "rolled_back"
@dataclass
class DeploymentRecord:
"""Record of a deployment."""
id: str
version: str
started_at: datetime
completed_at: Optional[datetime]
state: DeploymentState
active_environment: Environment
previous_environment: Environment
deployed_by: str
rollback_reason: Optional[str] = None
class BlueGreenDeployer:
"""
Orchestrates blue-green deployments.
"""
def __init__(
self,
load_balancer,
blue_cluster,
green_cluster,
health_checker,
metrics_client
):
self.lb = load_balancer
self.blue = blue_cluster
self.green = green_cluster
self.health = health_checker
self.metrics = metrics_client
self.current_environment = None
self.current_deployment = None
async def deploy(
self,
version: str,
deployed_by: str,
skip_tests: bool = False
) -> DeploymentRecord:
"""
Execute a blue-green deployment.
"""
# Determine which environment to deploy to
if self.current_environment == Environment.BLUE:
target = Environment.GREEN
target_cluster = self.green
else:
target = Environment.BLUE
target_cluster = self.blue
deployment = DeploymentRecord(
id=str(uuid.uuid4()),
version=version,
started_at=datetime.utcnow(),
completed_at=None,
state=DeploymentState.DEPLOYING,
active_environment=self.current_environment,
previous_environment=self.current_environment,
deployed_by=deployed_by
)
try:
# Step 1: Deploy to inactive environment
logger.info(f"Deploying {version} to {target.value}")
deployment.state = DeploymentState.DEPLOYING
await target_cluster.deploy(version)
# Step 2: Run health checks and tests
logger.info(f"Running health checks on {target.value}")
deployment.state = DeploymentState.TESTING
if not skip_tests:
health_result = await self.health.check_cluster(target_cluster)
if not health_result.healthy:
raise DeploymentError(f"Health check failed: {health_result.reason}")
# Run smoke tests
test_result = await self._run_smoke_tests(target_cluster)
if not test_result.passed:
raise DeploymentError(f"Smoke tests failed: {test_result.failures}")
# Step 3: Switch traffic
logger.info(f"Switching traffic to {target.value}")
deployment.state = DeploymentState.SWITCHING
await self.lb.switch_to(target)
# Step 4: Verify in production
logger.info("Verifying deployment")
deployment.state = DeploymentState.VERIFYING
# Wait and check metrics
await asyncio.sleep(30) # Give it 30 seconds
verification = await self._verify_deployment(version)
if not verification.success:
logger.warning(f"Verification failed: {verification.reason}")
await self.rollback(deployment, verification.reason)
return deployment
# Step 5: Complete
deployment.state = DeploymentState.COMPLETE
deployment.completed_at = datetime.utcnow()
deployment.active_environment = target
self.current_environment = target
self.current_deployment = deployment
logger.info(f"Deployment {deployment.id} complete")
return deployment
except Exception as e:
logger.error(f"Deployment failed: {e}")
await self.rollback(deployment, str(e))
raise
async def rollback(
self,
deployment: DeploymentRecord,
reason: str
):
"""
Rollback to previous environment.
This is INSTANT because the old environment is still running.
"""
logger.warning(f"Rolling back deployment {deployment.id}: {reason}")
deployment.state = DeploymentState.ROLLED_BACK
deployment.rollback_reason = reason
deployment.completed_at = datetime.utcnow()
# Switch traffic back to previous environment
await self.lb.switch_to(deployment.previous_environment)
# Record rollback metric
self.metrics.increment(
"deployments_rolled_back_total",
labels={"version": deployment.version, "reason": reason}
)
logger.info(f"Rollback complete. Traffic restored to {deployment.previous_environment.value}")
async def _run_smoke_tests(self, cluster) -> TestResult:
"""Run smoke tests against a cluster."""
tests = [
self._test_health_endpoint(cluster),
self._test_api_endpoint(cluster),
self._test_database_connectivity(cluster),
]
results = await asyncio.gather(*tests, return_exceptions=True)
failures = [r for r in results if isinstance(r, Exception) or not r.passed]
return TestResult(
passed=len(failures) == 0,
failures=[str(f) for f in failures]
)
async def _verify_deployment(self, version: str) -> VerificationResult:
"""Verify deployment is healthy in production."""
# Check error rate
error_rate = await self.metrics.query(
'sum(rate(http_requests_total{status=~"5.."}[1m])) / '
'sum(rate(http_requests_total[1m]))'
)
if error_rate > 0.01: # > 1% error rate
return VerificationResult(
success=False,
reason=f"Error rate too high: {error_rate:.2%}"
)
# Check latency
p99_latency = await self.metrics.query(
'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))'
)
if p99_latency > 1.0: # > 1 second
return VerificationResult(
success=False,
reason=f"Latency too high: {p99_latency:.2f}s"
)
return VerificationResult(success=True, reason=None)
2.3 Blue-Green Pros and Cons
BLUE-GREEN: WHEN TO USE
✅ ADVANTAGES:
├── Instant rollback (just switch back)
├── Full testing before switch
├── Zero downtime
├── Clean separation of environments
└── Easy to understand
❌ DISADVANTAGES:
├── Double infrastructure cost
├── Database must support both versions
├── Long-running connections may be disrupted
├── Not gradual (all-or-nothing switch)
└── Requires careful state management
BEST FOR:
├── Applications where instant rollback is critical
├── Teams with mature CI/CD pipelines
├── Services with clear version boundaries
└── When you can afford 2x infrastructure
NOT IDEAL FOR:
├── Services with lots of persistent connections
├── Applications with complex database migrations
├── Extremely high-traffic services (state issues)
└── Teams without automation
Chapter 3: Canary Deployment
3.1 How Canary Works
CANARY DEPLOYMENT
Named after "canary in a coal mine" — test with a small group first.
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ STAGE 1: CANARY (1% traffic) │
│ │
│ Load Balancer │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ │ │
│ ▼ 99% ▼ 1% │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ STABLE │ │ CANARY │ │
│ │ v1.0 │ │ v1.1 │ │
│ │ 10 pods │ │ 1 pod │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Monitor for 15 minutes: │
│ ├── Compare error rate: canary vs stable │
│ ├── Compare latency: canary vs stable │
│ └── Check business metrics │
│ │
│ ═══════════════════════════════════════════════════════════════════ │
│ │
│ STAGE 2: EXPAND (10% traffic) │
│ │
│ ┌──────────┴──────────┐ │
│ │ │ │
│ ▼ 90% ▼ 10% │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ STABLE │ │ CANARY │ │
│ │ v1.0 │ │ v1.1 │ │
│ │ 9 pods │ │ 2 pods │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ ═══════════════════════════════════════════════════════════════════ │
│ │
│ STAGE 3: EXPAND (50% traffic) │
│ │
│ ┌──────────┴──────────┐ │
│ │ │ │
│ ▼ 50% ▼ 50% │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ STABLE │ │ CANARY │ │
│ │ v1.0 │ │ v1.1 │ │
│ │ 5 pods │ │ 5 pods │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ ═══════════════════════════════════════════════════════════════════ │
│ │
│ STAGE 4: COMPLETE (100% traffic) │
│ │
│ │ │
│ ▼ 100% │
│ ┌─────────────┐ │
│ │ STABLE │ │
│ │ v1.1 │ │
│ │ 10 pods │ │
│ └─────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
3.2 Canary Implementation
# deployment/canary.py
"""
Canary deployment with automated analysis.
Gradually rolls out new version while comparing
metrics against the stable version.
"""
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime, timedelta
from enum import Enum
import asyncio
import logging
logger = logging.getLogger(__name__)
@dataclass
class CanaryStage:
"""A stage in the canary rollout."""
name: str
traffic_percentage: int
duration_minutes: int
auto_promote: bool = True # Auto-promote if metrics pass
@dataclass
class CanaryConfig:
"""Configuration for canary deployment."""
stages: List[CanaryStage] = field(default_factory=lambda: [
CanaryStage("canary-1pct", 1, 15),
CanaryStage("canary-10pct", 10, 15),
CanaryStage("canary-50pct", 50, 15),
CanaryStage("full-rollout", 100, 0),
])
# Thresholds for automated analysis
max_error_rate_increase: float = 0.01 # 1% absolute increase
max_latency_increase_pct: float = 0.10 # 10% relative increase
min_request_count: int = 100 # Minimum requests before analysis
@dataclass
class CanaryAnalysis:
"""Result of canary analysis."""
passed: bool
stable_error_rate: float
canary_error_rate: float
stable_p99_latency: float
canary_p99_latency: float
request_count: int
reason: Optional[str] = None
class CanaryDeployer:
"""
Orchestrates canary deployments with automated analysis.
"""
def __init__(
self,
kubernetes_client,
metrics_client,
traffic_manager,
config: CanaryConfig = None
):
self.k8s = kubernetes_client
self.metrics = metrics_client
self.traffic = traffic_manager
self.config = config or CanaryConfig()
async def deploy(
self,
version: str,
deployment_name: str,
namespace: str = "default"
) -> CanaryResult:
"""
Execute a canary deployment.
"""
logger.info(f"Starting canary deployment of {version}")
result = CanaryResult(
version=version,
started_at=datetime.utcnow(),
stages_completed=[]
)
try:
# Deploy canary pods (initially with 0% traffic)
await self._deploy_canary_pods(version, deployment_name, namespace)
# Progress through stages
for stage in self.config.stages:
logger.info(f"Entering stage: {stage.name} ({stage.traffic_percentage}%)")
# Shift traffic
await self.traffic.set_canary_weight(
deployment_name,
stage.traffic_percentage
)
# Wait for stage duration
if stage.duration_minutes > 0:
await asyncio.sleep(stage.duration_minutes * 60)
# Analyze metrics
analysis = await self._analyze_canary(deployment_name)
if not analysis.passed:
logger.warning(f"Canary failed at {stage.name}: {analysis.reason}")
await self._rollback(deployment_name, namespace)
result.failed = True
result.failure_reason = analysis.reason
result.failure_stage = stage.name
return result
logger.info(f"Stage {stage.name} passed analysis")
result.stages_completed.append(stage.name)
# Promote canary to stable
await self._promote_canary(deployment_name, namespace)
result.completed_at = datetime.utcnow()
result.success = True
logger.info(f"Canary deployment complete: {version}")
return result
except Exception as e:
logger.error(f"Canary deployment failed: {e}")
await self._rollback(deployment_name, namespace)
result.failed = True
result.failure_reason = str(e)
return result
async def _analyze_canary(self, deployment_name: str) -> CanaryAnalysis:
"""
Compare canary metrics against stable.
"""
# Get error rates
stable_errors = await self.metrics.query(f'''
sum(rate(http_requests_total{{
deployment="{deployment_name}",
version="stable",
status=~"5.."
}}[5m])) /
sum(rate(http_requests_total{{
deployment="{deployment_name}",
version="stable"
}}[5m]))
''')
canary_errors = await self.metrics.query(f'''
sum(rate(http_requests_total{{
deployment="{deployment_name}",
version="canary",
status=~"5.."
}}[5m])) /
sum(rate(http_requests_total{{
deployment="{deployment_name}",
version="canary"
}}[5m]))
''')
# Get latencies
stable_latency = await self.metrics.query(f'''
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{{
deployment="{deployment_name}",
version="stable"
}}[5m])) by (le)
)
''')
canary_latency = await self.metrics.query(f'''
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{{
deployment="{deployment_name}",
version="canary"
}}[5m])) by (le)
)
''')
# Get request count
request_count = await self.metrics.query(f'''
sum(increase(http_requests_total{{
deployment="{deployment_name}",
version="canary"
}}[5m]))
''')
# Check thresholds
analysis = CanaryAnalysis(
passed=True,
stable_error_rate=stable_errors,
canary_error_rate=canary_errors,
stable_p99_latency=stable_latency,
canary_p99_latency=canary_latency,
request_count=int(request_count)
)
# Not enough traffic to analyze
if analysis.request_count < self.config.min_request_count:
logger.warning(f"Not enough requests for analysis: {analysis.request_count}")
return analysis # Pass by default if not enough data
# Check error rate
error_increase = canary_errors - stable_errors
if error_increase > self.config.max_error_rate_increase:
analysis.passed = False
analysis.reason = (
f"Error rate increased by {error_increase:.2%} "
f"(threshold: {self.config.max_error_rate_increase:.2%})"
)
return analysis
# Check latency
if stable_latency > 0:
latency_increase = (canary_latency - stable_latency) / stable_latency
if latency_increase > self.config.max_latency_increase_pct:
analysis.passed = False
analysis.reason = (
f"Latency increased by {latency_increase:.2%} "
f"(threshold: {self.config.max_latency_increase_pct:.2%})"
)
return analysis
return analysis
async def _rollback(self, deployment_name: str, namespace: str):
"""Rollback canary deployment."""
logger.info(f"Rolling back canary for {deployment_name}")
# Remove canary traffic
await self.traffic.set_canary_weight(deployment_name, 0)
# Delete canary pods
await self.k8s.delete_canary_deployment(deployment_name, namespace)
logger.info("Rollback complete")
async def _promote_canary(self, deployment_name: str, namespace: str):
"""Promote canary to stable."""
logger.info(f"Promoting canary to stable for {deployment_name}")
# Update stable deployment with canary version
await self.k8s.promote_canary(deployment_name, namespace)
# Remove canary traffic split
await self.traffic.remove_canary_route(deployment_name)
logger.info("Promotion complete")
3.3 Canary vs Blue-Green
CANARY VS BLUE-GREEN COMPARISON
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ ASPECT │ BLUE-GREEN │ CANARY │
│ ────────────────────┼─────────────────────┼───────────────────────── │
│ Traffic switch │ All at once (100%) │ Gradual (1% → 100%) │
│ Risk exposure │ All users at once │ Limited users first │
│ Rollback speed │ Instant │ Fast (seconds) │
│ Infrastructure cost │ 2x during deploy │ 1x + small overhead │
│ Complexity │ Medium │ High │
│ Comparison testing │ Before switch only │ Side-by-side continuous │
│ Best for │ Quick full cutover │ Gradual risk reduction │
│ │
│ ═══════════════════════════════════════════════════════════════════ │
│ │
│ WHEN TO USE BLUE-GREEN: │
│ ├── Need instant rollback capability │
│ ├── Changes are well-tested │
│ ├── Want clean environment separation │
│ └── Can afford 2x infrastructure │
│ │
│ WHEN TO USE CANARY: │
│ ├── Want to minimize blast radius │
│ ├── Need production comparison metrics │
│ ├── High-traffic services (can't risk all users) │
│ └── Automated analysis/rollback desired │
│ │
│ ADVANCED: COMBINE THEM │
│ ├── Blue-Green for environments │
│ ├── Canary for traffic within an environment │
│ └── Best of both worlds │
│ │
└────────────────────────────────────────────────────────────────────────┘
Chapter 4: Feature Flags
4.1 Feature Flags Concept
FEATURE FLAGS
Decouple DEPLOYMENT from RELEASE.
Traditional:
Deploy code → Feature is live
Problem: Deploy is risky because features go live immediately
With Feature Flags:
Deploy code → Feature is OFF
Turn on flag → Feature is live
Benefit: Deploy anytime, release when ready
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ FEATURE FLAG TYPES │
│ │
│ 1. RELEASE FLAGS │
│ Control feature rollout │
│ ├── New checkout flow: OFF │
│ ├── Enable for 10% of users │
│ ├── Enable for beta users │
│ └── Enable for everyone │
│ │
│ 2. EXPERIMENT FLAGS │
│ A/B testing │
│ ├── Button color: "blue" vs "green" │
│ ├── Pricing display: "monthly" vs "annual" │
│ └── Measure conversion rates │
│ │
│ 3. OPS FLAGS │
│ Operational controls │
│ ├── Enable/disable expensive feature under load │
│ ├── Circuit breaker for external services │
│ └── Kill switch for problematic code │
│ │
│ 4. PERMISSION FLAGS │
│ User-specific features │
│ ├── Premium features for paid users │
│ ├── Beta access for selected users │
│ └── Admin features │
│ │
└────────────────────────────────────────────────────────────────────────┘
4.2 Feature Flag Implementation
# deployment/feature_flags.py
"""
Feature flag system for controlled rollouts.
Supports:
- Boolean flags (on/off)
- Percentage rollouts
- User targeting
- Environment targeting
"""
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Callable
from enum import Enum
import hashlib
import logging
logger = logging.getLogger(__name__)
class RolloutStrategy(Enum):
ALL = "all" # Flag applies to everyone
NONE = "none" # Flag applies to no one
PERCENTAGE = "percentage" # Random percentage of users
USER_LIST = "user_list" # Specific users
ATTRIBUTE = "attribute" # Based on user attributes
@dataclass
class FeatureFlag:
"""A feature flag definition."""
key: str
name: str
description: str
# Targeting
strategy: RolloutStrategy
percentage: int = 0 # For PERCENTAGE strategy
user_list: List[str] = field(default_factory=list) # For USER_LIST
attribute_rules: List[Dict] = field(default_factory=list) # For ATTRIBUTE
# Metadata
owner: str = ""
created_at: datetime = None
expires_at: Optional[datetime] = None
# Default value
default_value: bool = False
# Variants (for A/B tests)
variants: Dict[str, Any] = field(default_factory=dict)
@dataclass
class EvaluationContext:
"""Context for evaluating a feature flag."""
user_id: Optional[str] = None
tenant_id: Optional[str] = None
environment: str = "production"
attributes: Dict[str, Any] = field(default_factory=dict)
class FeatureFlagService:
"""
Feature flag evaluation service.
"""
def __init__(self, flag_store, metrics_client):
self.store = flag_store
self.metrics = metrics_client
self._cache = {}
def is_enabled(
self,
flag_key: str,
context: EvaluationContext = None
) -> bool:
"""
Check if a feature flag is enabled.
Usage:
if feature_flags.is_enabled("new_checkout", context):
return new_checkout_flow()
else:
return old_checkout_flow()
"""
context = context or EvaluationContext()
try:
flag = self._get_flag(flag_key)
if flag is None:
logger.warning(f"Unknown flag: {flag_key}")
return False
# Check if expired
if flag.expires_at and datetime.utcnow() > flag.expires_at:
return flag.default_value
# Evaluate based on strategy
result = self._evaluate(flag, context)
# Record metric
self.metrics.increment(
"feature_flag_evaluation_total",
labels={
"flag": flag_key,
"result": str(result),
"strategy": flag.strategy.value
}
)
return result
except Exception as e:
logger.error(f"Error evaluating flag {flag_key}: {e}")
return False
def get_variant(
self,
flag_key: str,
context: EvaluationContext = None
) -> Optional[str]:
"""
Get variant for A/B test flags.
Usage:
variant = feature_flags.get_variant("button_color", context)
# Returns "blue" or "green" based on user bucketing
"""
context = context or EvaluationContext()
flag = self._get_flag(flag_key)
if flag is None or not flag.variants:
return None
# Consistent hashing to assign user to variant
bucket = self._get_bucket(flag_key, context.user_id)
# Distribute buckets across variants
cumulative = 0
for variant_name, percentage in flag.variants.items():
cumulative += percentage
if bucket < cumulative:
return variant_name
return list(flag.variants.keys())[0] # Default to first variant
def _evaluate(self, flag: FeatureFlag, context: EvaluationContext) -> bool:
"""Evaluate flag based on its strategy."""
if flag.strategy == RolloutStrategy.ALL:
return True
if flag.strategy == RolloutStrategy.NONE:
return False
if flag.strategy == RolloutStrategy.PERCENTAGE:
return self._evaluate_percentage(flag, context)
if flag.strategy == RolloutStrategy.USER_LIST:
return context.user_id in flag.user_list
if flag.strategy == RolloutStrategy.ATTRIBUTE:
return self._evaluate_attributes(flag, context)
return flag.default_value
def _evaluate_percentage(
self,
flag: FeatureFlag,
context: EvaluationContext
) -> bool:
"""
Percentage-based evaluation with consistent hashing.
Same user always gets same result for same flag.
"""
bucket = self._get_bucket(flag.key, context.user_id)
return bucket < flag.percentage
def _get_bucket(self, flag_key: str, user_id: str) -> int:
"""
Get consistent bucket (0-99) for user and flag.
Same user + flag = same bucket, always.
"""
if not user_id:
# Random for anonymous users
import random
return random.randint(0, 99)
# Hash user_id + flag_key for consistent bucketing
hash_input = f"{flag_key}:{user_id}"
hash_value = hashlib.md5(hash_input.encode()).hexdigest()
return int(hash_value[:8], 16) % 100
def _evaluate_attributes(
self,
flag: FeatureFlag,
context: EvaluationContext
) -> bool:
"""Evaluate attribute-based rules."""
for rule in flag.attribute_rules:
attribute = rule.get("attribute")
operator = rule.get("operator")
value = rule.get("value")
user_value = context.attributes.get(attribute)
if operator == "equals" and user_value == value:
return True
if operator == "in" and user_value in value:
return True
if operator == "greater_than" and user_value > value:
return True
return flag.default_value
def _get_flag(self, flag_key: str) -> Optional[FeatureFlag]:
"""Get flag from store with caching."""
if flag_key in self._cache:
return self._cache[flag_key]
flag = self.store.get(flag_key)
if flag:
self._cache[flag_key] = flag
return flag
# =============================================================================
# USAGE EXAMPLES
# =============================================================================
# Initialize
feature_flags = FeatureFlagService(flag_store, metrics)
# Simple boolean check
if feature_flags.is_enabled("new_checkout"):
process_with_new_checkout()
else:
process_with_old_checkout()
# With user context
context = EvaluationContext(
user_id="user-123",
tenant_id="tenant-456",
attributes={
"plan": "enterprise",
"country": "US"
}
)
if feature_flags.is_enabled("advanced_analytics", context):
show_advanced_analytics()
# A/B test
button_color = feature_flags.get_variant("button_color_test", context)
render_button(color=button_color)
4.3 Feature Flag Best Practices
FEATURE FLAG BEST PRACTICES
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ 1. HAVE A LIFECYCLE │
│ ─────────────────── │
│ Flags should be temporary! │
│ │
│ Lifecycle: │
│ ├── Create flag (with owner and expiration) │
│ ├── Roll out gradually │
│ ├── Enable for 100% │
│ ├── REMOVE FLAG from code │
│ └── Archive flag │
│ │
│ Stale flags are tech debt. │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 2. KEEP FLAGS SIMPLE │
│ ──────────────────── │
│ Flags should gate features, not business logic. │
│ │
│ ✅ GOOD: feature_flags.is_enabled("new_search") │
│ ❌ BAD: feature_flags.get_value("search_algorithm_config") │
│ │
│ Complex configuration should be in config files, not flags. │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 3. TEST BOTH PATHS │
│ ────────────────── │
│ Every flag creates a code branch. Test both. │
│ │
│ def test_feature_enabled(): │
│ with flag_enabled("new_checkout"): │
│ assert checkout() == expected_new │
│ │
│ def test_feature_disabled(): │
│ with flag_disabled("new_checkout"): │
│ assert checkout() == expected_old │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 4. HAVE KILL SWITCHES │
│ ───────────────────── │
│ Some flags should always exist for operational control. │
│ │
│ Examples: │
│ ├── disable_expensive_queries │
│ ├── enable_read_only_mode │
│ ├── disable_external_integrations │
│ └── enable_maintenance_mode │
│ │
│ These are not temporary — they're operational controls. │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 5. MONITOR FLAG USAGE │
│ ───────────────────── │
│ Track: │
│ ├── How often each flag is evaluated │
│ ├── Which flags are always true/false (candidates for removal) │
│ ├── Which flags are never evaluated (dead code) │
│ └── Errors in flag evaluation │
│ │
└────────────────────────────────────────────────────────────────────────┘
Part II: Advanced Topics
Chapter 5: Database Migrations
5.1 The Database Migration Problem
THE DATABASE MIGRATION CHALLENGE
Code deployments are (relatively) easy:
├── Deploy new code
├── If broken, rollback to old code
└── Done
Database changes are HARD:
├── Can't just "rollback" a migration
├── Schema changes affect running code
├── Data migrations can take hours
├── Both old and new code must work with DB
└── Mistakes can cause data loss
The key insight:
During deployment, BOTH versions of code are running.
Your database must support BOTH versions.
5.2 Expand-Contract Pattern
EXPAND-CONTRACT (PARALLEL CHANGE) PATTERN
The safe way to make breaking schema changes.
Example: Rename column "name" to "full_name"
❌ WRONG: Direct rename
1. Deploy migration: ALTER TABLE users RENAME COLUMN name TO full_name;
2. Deploy code that uses "full_name"
Problem: During deployment, old code (using "name") and new code
(using "full_name") are both running. One of them will break.
✅ RIGHT: Expand-Contract (3-phase migration)
PHASE 1: EXPAND
- Add new column
- Write to both columns
- Read from old column
Migration:
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
Code:
# Write to both
UPDATE users SET name = $1, full_name = $1 WHERE id = $2
# Read from old
SELECT name FROM users WHERE id = $1
Deploy code first, then run migration.
Both old and new code work.
PHASE 2: MIGRATE
- Backfill data
- Switch reads to new column
- Continue writing to both
Migration:
UPDATE users SET full_name = name WHERE full_name IS NULL;
Code:
# Write to both (for safety)
UPDATE users SET name = $1, full_name = $1 WHERE id = $2
# Read from new
SELECT full_name FROM users WHERE id = $1
Deploy code first, then run migration.
PHASE 3: CONTRACT
- Stop writing to old column
- Remove old column
Code:
# Write to new only
UPDATE users SET full_name = $1 WHERE id = $2
# Read from new
SELECT full_name FROM users WHERE id = $1
Migration (after all old code is gone):
ALTER TABLE users DROP COLUMN name;
Timeline:
═══════════════════════════════════════════════════════════════════►
│ │ │ │ │
│ Original│ Expand │ Migrate │ Contract │
│ │ │ │ │
│ name │ name + │ name + │ full_name │
│ only │ full_name │ full_name │ only │
│ │ (write both) │ (read new) │ │
5.3 Migration Implementation
# deployment/migrations.py
"""
Safe database migration patterns.
Implements expand-contract for breaking changes.
"""
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional
import logging
logger = logging.getLogger(__name__)
class MigrationPhase(Enum):
EXPAND = "expand"
MIGRATE = "migrate"
CONTRACT = "contract"
@dataclass
class Migration:
"""A database migration."""
version: str
name: str
phase: MigrationPhase
# SQL to run
up_sql: str
down_sql: str
# Validation
pre_check_sql: Optional[str] = None # Must pass before running
post_check_sql: Optional[str] = None # Must pass after running
# Safety
requires_lock: bool = False # Requires table lock
estimated_duration: str = "" # e.g., "5 minutes", "2 hours"
affects_writes: bool = False # Will block writes
class SafeMigrationRunner:
"""
Runs migrations safely with validation and rollback support.
"""
def __init__(self, db_connection, lock_manager):
self.db = db_connection
self.locks = lock_manager
async def run_migration(
self,
migration: Migration,
dry_run: bool = False
) -> MigrationResult:
"""
Run a migration with safety checks.
"""
logger.info(f"Starting migration: {migration.name}")
result = MigrationResult(
migration=migration,
started_at=datetime.utcnow()
)
try:
# Pre-check
if migration.pre_check_sql:
check_result = await self.db.fetch_one(migration.pre_check_sql)
if not check_result or not check_result[0]:
raise MigrationError("Pre-check failed")
# Acquire lock if needed
if migration.requires_lock:
await self.locks.acquire("migration_lock")
# Run migration
if not dry_run:
await self.db.execute(migration.up_sql)
else:
logger.info(f"DRY RUN: Would execute: {migration.up_sql}")
# Post-check
if migration.post_check_sql:
check_result = await self.db.fetch_one(migration.post_check_sql)
if not check_result or not check_result[0]:
raise MigrationError("Post-check failed, rolling back")
# Record success
await self._record_migration(migration)
result.success = True
result.completed_at = datetime.utcnow()
logger.info(f"Migration complete: {migration.name}")
return result
except Exception as e:
logger.error(f"Migration failed: {e}")
# Attempt rollback
if not dry_run:
try:
await self.db.execute(migration.down_sql)
logger.info("Rollback successful")
except Exception as rollback_error:
logger.error(f"Rollback failed: {rollback_error}")
result.success = False
result.error = str(e)
return result
finally:
if migration.requires_lock:
await self.locks.release("migration_lock")
# =============================================================================
# EXAMPLE: RENAME COLUMN MIGRATION SET
# =============================================================================
RENAME_COLUMN_MIGRATIONS = [
# Phase 1: Expand - Add new column
Migration(
version="2024_01_15_001",
name="add_full_name_column",
phase=MigrationPhase.EXPAND,
up_sql="""
ALTER TABLE users
ADD COLUMN full_name VARCHAR(255);
-- Trigger to keep columns in sync during transition
CREATE OR REPLACE FUNCTION sync_name_columns()
RETURNS TRIGGER AS $$
BEGIN
IF NEW.name IS DISTINCT FROM OLD.name THEN
NEW.full_name := NEW.name;
ELSIF NEW.full_name IS DISTINCT FROM OLD.full_name THEN
NEW.name := NEW.full_name;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER sync_names
BEFORE UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_name_columns();
""",
down_sql="""
DROP TRIGGER IF EXISTS sync_names ON users;
DROP FUNCTION IF EXISTS sync_name_columns();
ALTER TABLE users DROP COLUMN IF EXISTS full_name;
""",
post_check_sql="""
SELECT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name = 'users' AND column_name = 'full_name'
);
"""
),
# Phase 2: Migrate - Backfill data
Migration(
version="2024_01_16_001",
name="backfill_full_name",
phase=MigrationPhase.MIGRATE,
up_sql="""
-- Backfill in batches to avoid locking
DO $$
DECLARE
batch_size INT := 10000;
updated INT;
BEGIN
LOOP
UPDATE users
SET full_name = name
WHERE full_name IS NULL
AND id IN (
SELECT id FROM users
WHERE full_name IS NULL
LIMIT batch_size
);
GET DIAGNOSTICS updated = ROW_COUNT;
EXIT WHEN updated = 0;
-- Small pause between batches
PERFORM pg_sleep(0.1);
END LOOP;
END $$;
""",
down_sql="-- No rollback needed for backfill",
post_check_sql="""
SELECT COUNT(*) = 0
FROM users
WHERE full_name IS NULL AND name IS NOT NULL;
""",
estimated_duration="30 minutes for 1M rows"
),
# Phase 3: Contract - Remove old column (after code fully migrated)
Migration(
version="2024_01_20_001",
name="remove_name_column",
phase=MigrationPhase.CONTRACT,
up_sql="""
DROP TRIGGER IF EXISTS sync_names ON users;
DROP FUNCTION IF EXISTS sync_name_columns();
ALTER TABLE users DROP COLUMN name;
""",
down_sql="""
ALTER TABLE users ADD COLUMN name VARCHAR(255);
UPDATE users SET name = full_name;
""",
pre_check_sql="""
-- Ensure no code is reading from 'name' column
-- This should be checked via application logs/metrics
SELECT true;
""",
requires_lock=True
),
]
Chapter 6: Rollback Strategies
6.1 Types of Rollback
ROLLBACK STRATEGIES
┌────────────────────────────────────────────────────────────────────────┐
│ │
│ 1. CODE ROLLBACK │
│ ───────────────── │
│ Roll back to previous code version. │
│ │
│ Methods: │
│ ├── kubectl rollout undo deployment/myapp │
│ ├── Deploy previous version tag │
│ ├── Blue-green: switch back to previous environment │
│ └── Revert git commit, redeploy │
│ │
│ Speed: Fast (seconds to minutes) │
│ Risk: Low if code is backward compatible │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 2. FEATURE ROLLBACK │
│ ──────────────────── │
│ Turn off feature flag without code change. │
│ │
│ Methods: │
│ ├── Flip flag in feature flag service │
│ ├── No deployment needed │
│ └── Instant effect │
│ │
│ Speed: Instant (seconds) │
│ Risk: Very low │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 3. CONFIG ROLLBACK │
│ ────────────────── │
│ Revert configuration change. │
│ │
│ Methods: │
│ ├── Update ConfigMap/environment variables │
│ ├── Rolling restart to pick up new config │
│ └── Some systems support hot reload │
│ │
│ Speed: Minutes │
│ Risk: Low to medium │
│ │
│ ──────────────────────────────────────────────────────────────────── │
│ │
│ 4. DATA ROLLBACK │
│ ───────────────── │
│ Revert data changes. THE HARDEST TYPE. │
│ │
│ Methods: │
│ ├── Restore from backup (slow, nuclear option) │
│ ├── Run compensating transactions │
│ ├── Point-in-time recovery │
│ └── Manual data fixes │
│ │
│ Speed: Slow (minutes to hours) │
│ Risk: High (potential data loss) │
│ │
│ BEST APPROACH: Avoid needing data rollback │
│ ├── Test migrations thoroughly │
│ ├── Use expand-contract pattern │
│ ├── Make changes reversible │
│ └── Take backups before risky changes │
│ │
└────────────────────────────────────────────────────────────────────────┘
6.2 Automated Rollback
# deployment/auto_rollback.py
"""
Automated rollback based on metrics.
If deployment causes problems, automatically roll back.
"""
from dataclasses import dataclass
from typing import Optional
import asyncio
import logging
logger = logging.getLogger(__name__)
@dataclass
class RollbackConfig:
"""Configuration for automated rollback."""
# Metric thresholds
max_error_rate: float = 0.05 # 5%
max_latency_p99_ms: int = 1000 # 1 second
min_success_rate: float = 0.95 # 95%
# Timing
observation_window_seconds: int = 300 # 5 minutes
check_interval_seconds: int = 30
# Behavior
auto_rollback_enabled: bool = True
require_confirmation: bool = False # For critical services
class AutomatedRollbackMonitor:
"""
Monitors deployments and triggers rollback if metrics degrade.
"""
def __init__(
self,
metrics_client,
deployer,
alerter,
config: RollbackConfig = None
):
self.metrics = metrics_client
self.deployer = deployer
self.alerter = alerter
self.config = config or RollbackConfig()
async def monitor_deployment(
self,
deployment_id: str,
baseline_metrics: dict
):
"""
Monitor a deployment and rollback if needed.
"""
logger.info(f"Starting rollback monitor for deployment {deployment_id}")
start_time = datetime.utcnow()
end_time = start_time + timedelta(seconds=self.config.observation_window_seconds)
while datetime.utcnow() < end_time:
await asyncio.sleep(self.config.check_interval_seconds)
# Check metrics
current_metrics = await self._get_current_metrics()
# Compare to baseline and thresholds
should_rollback, reason = self._should_rollback(
baseline_metrics,
current_metrics
)
if should_rollback:
logger.warning(f"Rollback triggered: {reason}")
if self.config.require_confirmation:
# Alert and wait for confirmation
await self.alerter.send_critical(
f"Deployment {deployment_id} may need rollback: {reason}. "
f"Reply 'ROLLBACK' to confirm."
)
# In a real system, this would wait for response
elif self.config.auto_rollback_enabled:
await self._execute_rollback(deployment_id, reason)
return RollbackResult(
rolled_back=True,
reason=reason,
metrics=current_metrics
)
logger.info(f"Deployment {deployment_id} passed monitoring period")
return RollbackResult(rolled_back=False, metrics=current_metrics)
def _should_rollback(
self,
baseline: dict,
current: dict
) -> tuple[bool, Optional[str]]:
"""
Determine if rollback is needed.
"""
# Check absolute thresholds
if current['error_rate'] > self.config.max_error_rate:
return True, f"Error rate {current['error_rate']:.2%} exceeds threshold"
if current['latency_p99_ms'] > self.config.max_latency_p99_ms:
return True, f"Latency {current['latency_p99_ms']}ms exceeds threshold"
if current['success_rate'] < self.config.min_success_rate:
return True, f"Success rate {current['success_rate']:.2%} below threshold"
# Check relative degradation vs baseline
if baseline:
error_increase = current['error_rate'] - baseline['error_rate']
if error_increase > 0.02: # 2% absolute increase
return True, f"Error rate increased by {error_increase:.2%}"
latency_increase_pct = (
(current['latency_p99_ms'] - baseline['latency_p99_ms']) /
baseline['latency_p99_ms']
) if baseline['latency_p99_ms'] > 0 else 0
if latency_increase_pct > 0.5: # 50% increase
return True, f"Latency increased by {latency_increase_pct:.0%}"
return False, None
async def _execute_rollback(self, deployment_id: str, reason: str):
"""Execute the rollback."""
logger.warning(f"Executing rollback for {deployment_id}")
# Send alert
await self.alerter.send_critical(
f"AUTO-ROLLBACK executed for deployment {deployment_id}. "
f"Reason: {reason}"
)
# Execute rollback
await self.deployer.rollback(deployment_id)
logger.info(f"Rollback complete for {deployment_id}")
Part III: Real-World Application
Chapter 7: Case Studies
7.1 How Amazon Deploys
AMAZON'S DEPLOYMENT PRACTICES
Scale:
├── Thousands of deployments per day
├── Thousands of services
├── Millions of servers
└── Zero-downtime expectation
Key Practices:
1. ONE-BOX DEPLOYMENT
├── Deploy to single instance first
├── Run tests and monitor
├── If healthy, proceed to wider rollout
└── If not, automatic rollback
2. WAVE DEPLOYMENTS
├── Deploy in waves: 1% → 5% → 10% → 25% → 50% → 100%
├── Bake time between waves
├── Automated health checks between waves
└── Automatic rollback if metrics degrade
3. REGION-BY-REGION
├── Deploy to one region first
├── Verify in production
├── Then roll out to other regions
└── Can quickly isolate issues
4. FEATURE FLAGS
├── Deploy code dark
├── Enable features gradually
├── A/B testing for new features
└── Kill switches for all major features
5. APOLLO (Config Management)
├── Separate config from code
├── Config changes without deploy
├── Instant config propagation
└── Versioned config with rollback
Lessons:
├── Small changes are safer than big changes
├── Automate everything
├── Monitor everything
├── Roll back fast
└── Deploy often to reduce batch size
7.2 How Google Does It
GOOGLE'S DEPLOYMENT APPROACH
Philosophy:
"Make rollbacks easy and deployments boring."
Key Systems:
1. BORG/KUBERNETES
├── Declarative deployment
├── Automated rollout/rollback
├── Health checks built-in
└── Self-healing
2. STAGED ROLLOUTS
├── Canary first
├── Automated analysis
├── Progressive percentage increase
└── Cross-cluster, cross-region
3. RELEASE TRAINS
├── Regular release cadence
├── Features either make the train or wait
├── Predictable release schedule
└── Reduces urgency and risk
4. BINARY VS CONFIG
├── Binary releases: Weekly or less frequent
├── Config changes: Anytime
├── Most "releases" are config changes
└── Reduces code deployment risk
5. CANARYING EVERYTHING
├── Code changes: Canary
├── Config changes: Canary
├── Capacity changes: Canary
└── "Everything that can break should canary"
SRE Integration:
├── SRE team owns deployment tools
├── Error budget gates deployments
├── If SLO violated, deployment paused
└── Reliability is a gating criterion
Chapter 8: Common Mistakes
DEPLOYMENT ANTI-PATTERNS
❌ MISTAKE 1: Big Bang Deployments
Wrong:
├── Save up changes for months
├── Deploy everything at once
└── Hope for the best
Problems:
├── Huge blast radius if something breaks
├── Hard to identify which change caused issues
└── Rollback affects all changes
Right:
├── Deploy frequently (daily or more)
├── Small, incremental changes
└── Easy to identify and rollback specific changes
❌ MISTAKE 2: No Rollback Plan
Wrong:
"We'll figure out rollback if we need it"
Problems:
├── Panic when things break
├── Unclear process under pressure
└── Longer outages
Right:
├── Test rollback BEFORE you need it
├── Document rollback procedure
├── Automate if possible
└── Practice regularly
❌ MISTAKE 3: Database Changes with Code
Wrong:
├── Code and schema change in same deployment
├── Migration runs during deployment
└── If either fails, both rollback needed
Problems:
├── Complex rollback
├── Data migration might not be reversible
└── Extended downtime
Right:
├── Separate database changes from code changes
├── Use expand-contract pattern
├── Database changes first, then code
└── Each can roll back independently
❌ MISTAKE 4: Deploy on Friday
Wrong:
"Let's ship this before the weekend!"
Problems:
├── Reduced staffing if issues
├── Tired people making decisions
└── Customer impact over weekend
Right:
├── Deploy early in the week
├── Deploy early in the day
├── Have full team available for monitoring
└── Friday deployments only for emergencies
❌ MISTAKE 5: No Monitoring During Deploy
Wrong:
├── Deploy and walk away
├── Assume success
└── Find out about problems from customers
Right:
├── Watch dashboards during deploy
├── Have alerts set for deployment metrics
├── Bake time with active monitoring
└── Only declare success after observation period
Part IV: Interview Preparation
Chapter 9: Interview Tips
9.1 Deployment Discussion Framework
DISCUSSING DEPLOYMENTS IN INTERVIEWS
When asked "How would you deploy changes to this system?":
1. ASSESS RISK
"First, I'd assess the risk of this change.
Is it a small bug fix or a major feature?
Does it touch critical paths like payments?
Does it require database changes?"
2. CHOOSE STRATEGY
"Based on risk, I'd choose a deployment strategy:
- Low risk: Rolling deployment with monitoring
- Medium risk: Canary with automated analysis
- High risk: Blue-green with feature flag"
3. EXPLAIN SAFETY MECHANISMS
"I'd ensure safety through:
- Health checks before declaring success
- Metrics comparison (canary vs stable)
- Automated rollback if thresholds exceeded"
4. ADDRESS DATABASE CHANGES
"For database changes, I'd use expand-contract:
- Phase 1: Add new column/table
- Phase 2: Backfill data
- Phase 3: Switch code to use new schema
- Phase 4: Remove old schema"
5. DISCUSS ROLLBACK
"If something goes wrong:
- Blue-green: Instant switch back
- Canary: Remove canary traffic
- Feature flag: Turn off flag
- Code: Rollback to previous version"
9.2 Key Phrases
DEPLOYMENT KEY PHRASES
On Strategy Selection:
"I match deployment strategy to risk.
For a new payment feature, I'd use canary with
aggressive monitoring. For a UI tweak,
a standard rolling deployment is fine."
On Canary:
"Canary lets me compare the new version against
the stable version in production with real traffic.
If error rate increases or latency degrades,
I roll back before most users are affected."
On Feature Flags:
"Feature flags decouple deployment from release.
I can deploy code anytime, then enable the feature
gradually. If something's wrong, turning off
the flag is instant — no deployment needed."
On Database Migrations:
"I never make breaking schema changes directly.
I use expand-contract: add the new structure,
migrate data, switch the code, then remove
the old structure. This way, both old and new
code work at every step."
On Rollback:
"The first thing I consider with any change is:
how do I undo this? If I can't answer that,
I need to rethink the approach. Rollback
should be fast and safe."
Chapter 10: Practice Problems
Problem 1: E-commerce Checkout Deployment
Scenario: You're deploying a new checkout flow. How would you do it safely?
Questions:
- What deployment strategy would you use?
- What metrics would you monitor?
- What's your rollback plan?
-
Strategy: Canary with feature flag
- Deploy code with feature flag OFF
- Enable for 1% of users, monitor
- Gradually increase to 100%
-
Metrics:
- Checkout completion rate
- Error rate
- Payment success rate
- Latency p99
- Cart abandonment rate
-
Rollback:
- Turn off feature flag (instant)
- If persistent issue, roll back code deployment
Problem 2: Database Column Rename
Scenario:
You need to rename a column from email to primary_email. The table has 10 million rows.
Questions:
- What's the safest approach?
- What are the migration phases?
- How do you handle the transition period?
-
Approach: Expand-Contract
- Never directly rename a column
- Three-phase migration
-
Phases:
- Expand: Add
primary_emailcolumn - Migrate: Backfill data, switch code to use new column
- Contract: Remove
emailcolumn
- Expand: Add
-
Transition:
- Code writes to BOTH columns
- Code reads from new column (with fallback to old)
- Once all code updated, remove old column
Chapter 11: Sample Interview Dialogue
Interviewer: "You have a new feature that changes how we calculate shipping costs. How would you deploy it?"
You: "Let me think through this systematically. First, let me ask a few questions:
- How critical is shipping cost accuracy? Are mistakes costly?
- Is there a database component to this change?
- How much traffic does this code path see?"
Interviewer: "It's a high-volume checkout flow. Mistakes in shipping costs directly impact revenue. There's a new table for shipping zones."
You: "Given the high risk, I'd use a careful approach:
Phase 1: Database changes first"
├── Create new shipping_zones table
├── Populate with data
├── No code changes yet
├── Verify data is correct
"Phase 2: Deploy code behind feature flag"
├── Deploy new calculation code with flag OFF
├── Both old and new code paths exist
├── No user impact yet
"Phase 3: Shadow testing"
├── Run new calculation alongside old
├── Log both results, compare
├── Look for discrepancies
├── Fix any issues found
"Phase 4: Gradual rollout"
├── Enable for internal users first
├── Then 1% of customers
├── Monitor:
│ ├── Shipping cost accuracy
│ ├── Checkout completion rate
│ ├── Customer complaints
│ └── Revenue metrics
├── Gradually increase to 100%
Interviewer: "What if you find a bug at 10% rollout?"
You: "I'd turn off the feature flag immediately. Users would get the old calculation. Since the old code is still there, this is instant.
Then I'd analyze the bug, fix it, and start the rollout again from shadow testing.
The beauty of feature flags is that rollback is just a configuration change — no deployment needed."
Interviewer: "Good. How would you handle a bug in the database changes?"
You: "That's trickier. Database changes are harder to roll back.
For the shipping_zones table:
- It's additive (new table), so it doesn't break existing code
- If the data is wrong, I can update it without schema changes
- If the schema itself is wrong, I'd create a new table with the correct schema, migrate data, then drop the old one
The key is making database changes backward compatible. I never put code that depends on new schema in the same release as the schema change."
Summary
┌────────────────────────────────────────────────────────────────────────┐
│ DAY 3 KEY TAKEAWAYS │
│ │
│ DEPLOYMENT STRATEGIES: │
│ ├── Rolling: Gradual replacement, no downtime │
│ ├── Blue-Green: Instant switch, instant rollback │
│ ├── Canary: Gradual traffic shift with comparison │
│ └── Feature Flags: Decouple deployment from release │
│ │
│ SAFETY MECHANISMS: │
│ ├── Health checks before proceeding │
│ ├── Metric comparison (canary vs stable) │
│ ├── Automated rollback on threshold breach │
│ └── Observation periods between stages │
│ │
│ DATABASE MIGRATIONS: │
│ ├── Use expand-contract pattern │
│ ├── Never make breaking changes directly │
│ ├── Separate schema changes from code changes │
│ └── Both old and new code must work during transition │
│ │
│ ROLLBACK: │
│ ├── Code: Deploy previous version (fast) │
│ ├── Feature: Turn off flag (instant) │
│ ├── Config: Revert configuration (fast) │
│ └── Data: Restore from backup (slow, avoid) │
│ │
│ BEST PRACTICES: │
│ ├── Deploy frequently (reduces batch size) │
│ ├── Deploy early in the week (more support) │
│ ├── Monitor during deployment │
│ ├── Have a rollback plan before deploying │
│ └── Automate everything possible │
│ │
│ KEY INSIGHT: │
│ The safest deployment is a small, incremental one │
│ with automated monitoring and fast rollback. │
│ │
└────────────────────────────────────────────────────────────────────────┘
Further Reading
Books:
- "Continuous Delivery" by Jez Humble & David Farley
- "Accelerate" by Nicole Forsgren, Jez Humble, Gene Kim
Articles:
- Martin Fowler: "Blue Green Deployment"
- Google SRE Book: Chapter on Release Engineering
- AWS: "Blue/Green Deployments on AWS"
Tools:
- Kubernetes: Rolling updates, readiness probes
- Istio/Linkerd: Traffic shifting for canary
- LaunchDarkly/Split: Feature flag management
- Argo Rollouts: Progressive delivery for Kubernetes
End of Day 3: Deployment Strategies
Tomorrow: Day 4 — Capacity Planning. You can deploy safely. But how do you know you have enough capacity? How do you prepare for traffic spikes? How do you scale before you need to?
You now have the trifecta: Define health (SLOs). See health (observability). Maintain health through change (deployments). Tomorrow, we ensure health under load.