Week 6 — Day 5: Operations, Monitoring & Interview Mastery
System Design Mastery Series — Practical Application Week
Introduction
We've built a complete notification platform. Today we make it operable — the difference between a system that works and one that works at 3 AM when you're on-call.
Today's Theme: "A system you can't observe is a system you can't trust"
We'll cover:
- Observability (metrics, logs, traces)
- Alerting that doesn't cry wolf
- Runbooks for common incidents
- Deployment without downtime
- Complete interview walkthrough
Part I: Observability
Chapter 1: The Three Pillars
1.1 Metrics — What's Happening Now
# observability/metrics.py
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import time
class MetricType(Enum):
COUNTER = "counter" # Only goes up (requests, errors)
GAUGE = "gauge" # Can go up/down (queue depth, connections)
HISTOGRAM = "histogram" # Distribution (latency percentiles)
@dataclass
class MetricDefinition:
name: str
type: MetricType
description: str
labels: list[str]
# Core notification metrics
NOTIFICATION_METRICS = [
# Ingestion
MetricDefinition(
"notifications_received_total",
MetricType.COUNTER,
"Total notifications received by API",
["priority", "type"]
),
MetricDefinition(
"notifications_validated_total",
MetricType.COUNTER,
"Notifications that passed validation",
["priority", "type"]
),
MetricDefinition(
"notifications_rejected_total",
MetricType.COUNTER,
"Notifications rejected by validation",
["reason"]
),
# Queue
MetricDefinition(
"kafka_messages_produced_total",
MetricType.COUNTER,
"Messages published to Kafka",
["topic"]
),
MetricDefinition(
"kafka_consumer_lag",
MetricType.GAUGE,
"Consumer lag per partition",
["topic", "partition", "consumer_group"]
),
# Routing
MetricDefinition(
"notifications_routed_total",
MetricType.COUNTER,
"Notifications routed to channels",
["channel", "priority"]
),
MetricDefinition(
"routing_duration_seconds",
MetricType.HISTOGRAM,
"Time to route a notification",
["priority"]
),
# Delivery
MetricDefinition(
"notifications_sent_total",
MetricType.COUNTER,
"Notifications sent to providers",
["channel", "provider", "status"]
),
MetricDefinition(
"delivery_duration_seconds",
MetricType.HISTOGRAM,
"Time to deliver notification",
["channel", "provider"]
),
MetricDefinition(
"delivery_retries_total",
MetricType.COUNTER,
"Delivery retry attempts",
["channel", "provider", "attempt"]
),
# Provider health
MetricDefinition(
"provider_requests_total",
MetricType.COUNTER,
"Requests to external providers",
["provider", "status_code"]
),
MetricDefinition(
"provider_latency_seconds",
MetricType.HISTOGRAM,
"Provider response latency",
["provider"]
),
MetricDefinition(
"circuit_breaker_state",
MetricType.GAUGE,
"Circuit breaker state (0=closed, 1=open, 2=half-open)",
["provider"]
),
# Business metrics
MetricDefinition(
"notifications_delivered_total",
MetricType.COUNTER,
"Confirmed delivered notifications",
["channel", "type"]
),
MetricDefinition(
"notification_end_to_end_seconds",
MetricType.HISTOGRAM,
"Time from receipt to delivery confirmation",
["channel", "priority"]
),
]
class MetricsService:
"""
Centralized metrics collection.
Uses Prometheus client library conventions.
"""
def __init__(self, prometheus_client):
self.prom = prometheus_client
self._counters = {}
self._gauges = {}
self._histograms = {}
self._register_metrics()
def _register_metrics(self):
"""Register all metric definitions."""
for metric in NOTIFICATION_METRICS:
if metric.type == MetricType.COUNTER:
self._counters[metric.name] = self.prom.Counter(
metric.name,
metric.description,
metric.labels
)
elif metric.type == MetricType.GAUGE:
self._gauges[metric.name] = self.prom.Gauge(
metric.name,
metric.description,
metric.labels
)
elif metric.type == MetricType.HISTOGRAM:
self._histograms[metric.name] = self.prom.Histogram(
metric.name,
metric.description,
metric.labels,
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
def increment(self, name: str, labels: dict, value: int = 1):
"""Increment a counter."""
if name in self._counters:
self._counters[name].labels(**labels).inc(value)
def set_gauge(self, name: str, labels: dict, value: float):
"""Set a gauge value."""
if name in self._gauges:
self._gauges[name].labels(**labels).set(value)
def observe(self, name: str, labels: dict, value: float):
"""Observe a histogram value."""
if name in self._histograms:
self._histograms[name].labels(**labels).observe(value)
def time(self, name: str, labels: dict):
"""Context manager to time operations."""
return Timer(self, name, labels)
class Timer:
"""Context manager for timing operations."""
def __init__(self, metrics: MetricsService, name: str, labels: dict):
self.metrics = metrics
self.name = name
self.labels = labels
self.start = None
def __enter__(self):
self.start = time.perf_counter()
return self
def __exit__(self, *args):
duration = time.perf_counter() - self.start
self.metrics.observe(self.name, self.labels, duration)
1.2 Key Dashboards
# observability/dashboards.py
"""
Dashboard Definitions for Grafana
Organized by persona:
1. On-call engineer - Is the system healthy?
2. Product manager - Are users getting notifications?
3. Finance - How much are we spending?
"""
DASHBOARDS = {
"notification_overview": {
"title": "Notification System Overview",
"refresh": "10s",
"panels": [
{
"title": "Notifications/sec by Priority",
"type": "graph",
"query": 'sum(rate(notifications_received_total[1m])) by (priority)'
},
{
"title": "Delivery Success Rate",
"type": "stat",
"query": '''
sum(rate(notifications_sent_total{status="success"}[5m])) /
sum(rate(notifications_sent_total[5m])) * 100
'''
},
{
"title": "End-to-End Latency (p99)",
"type": "graph",
"query": 'histogram_quantile(0.99, sum(rate(notification_end_to_end_seconds_bucket[5m])) by (le, channel))'
},
{
"title": "Queue Depth",
"type": "graph",
"query": 'sum(kafka_consumer_lag) by (topic)'
},
]
},
"provider_health": {
"title": "Provider Health",
"panels": [
{
"title": "Provider Success Rate",
"type": "graph",
"query": '''
sum(rate(provider_requests_total{status_code=~"2.."}[5m])) by (provider) /
sum(rate(provider_requests_total[5m])) by (provider) * 100
'''
},
{
"title": "Provider Latency (p95)",
"type": "graph",
"query": 'histogram_quantile(0.95, sum(rate(provider_latency_seconds_bucket[5m])) by (le, provider))'
},
{
"title": "Circuit Breaker Status",
"type": "stat",
"query": 'circuit_breaker_state',
"mappings": {"0": "Closed", "1": "Open", "2": "Half-Open"}
},
]
},
"cost_tracking": {
"title": "Notification Costs",
"panels": [
{
"title": "Daily SMS Cost",
"type": "stat",
"query": 'sum(increase(notifications_sent_total{channel="sms", status="success"}[24h])) * 0.01'
},
{
"title": "Daily Email Cost",
"type": "stat",
"query": 'sum(increase(notifications_sent_total{channel="email", status="success"}[24h])) * 0.0001'
},
]
},
}
1.3 Structured Logging
# observability/logging.py
import json
import logging
from datetime import datetime
from typing import Optional
class StructuredLogger:
"""
Structured logging for notification events.
All logs are JSON for easy parsing by log aggregators.
"""
def __init__(self, service_name: str):
self.service = service_name
self.logger = logging.getLogger(service_name)
def _log(
self,
level: str,
message: str,
notification_id: Optional[str] = None,
user_id: Optional[str] = None,
channel: Optional[str] = None,
provider: Optional[str] = None,
trace_id: Optional[str] = None,
**extra
):
"""Create structured log entry."""
entry = {
"timestamp": datetime.utcnow().isoformat(),
"level": level,
"service": self.service,
"message": message,
}
if notification_id:
entry["notification_id"] = notification_id
if user_id:
entry["user_id"] = user_id
if channel:
entry["channel"] = channel
if provider:
entry["provider"] = provider
if trace_id:
entry["trace_id"] = trace_id
entry.update(extra)
log_func = getattr(self.logger, level.lower())
log_func(json.dumps(entry))
def notification_received(
self,
notification_id: str,
user_id: str,
notification_type: str,
priority: str,
trace_id: str
):
self._log(
"INFO",
"Notification received",
notification_id=notification_id,
user_id=user_id,
trace_id=trace_id,
notification_type=notification_type,
priority=priority
)
def delivery_attempted(
self,
notification_id: str,
channel: str,
provider: str,
success: bool,
duration_ms: float,
trace_id: str,
error: Optional[str] = None
):
level = "INFO" if success else "WARNING"
self._log(
level,
"Delivery attempted",
notification_id=notification_id,
channel=channel,
provider=provider,
trace_id=trace_id,
success=success,
duration_ms=duration_ms,
error=error
)
def circuit_breaker_state_change(
self,
provider: str,
old_state: str,
new_state: str,
failure_count: int
):
self._log(
"WARNING",
"Circuit breaker state changed",
provider=provider,
old_state=old_state,
new_state=new_state,
failure_count=failure_count
)
1.4 Distributed Tracing
# observability/tracing.py
"""
Distributed tracing for notification flow.
Tracks a notification through:
API → Router → Worker → Provider → Callback
Example trace:
trace_id: abc-123
├─ [notification-api] receive_notification (5ms)
│ └─ tags: notification_id=xyz, user_id=123, priority=high
│
├─ [notification-api] validate (10ms)
│
├─ [notification-api] write_to_outbox (15ms)
│
├─ [notification-router] route (25ms)
│ ├─ [notification-router] fetch_preferences (8ms)
│ └─ [notification-router] publish_to_channels (15ms)
│
├─ [push-worker] deliver (150ms)
│ ├─ [push-worker] fetch_device_tokens (20ms)
│ └─ [push-worker] send_to_fcm (125ms)
│ └─ tags: provider=fcm, status=success
│
└─ [email-worker] deliver (500ms)
└─ [email-worker] send_to_sendgrid (490ms)
└─ tags: provider=sendgrid, status=success
"""
from dataclasses import dataclass
from datetime import datetime
from typing import Optional
import uuid
@dataclass
class Span:
"""A single span in a trace."""
trace_id: str
span_id: str
parent_span_id: Optional[str]
operation_name: str
service_name: str
start_time: datetime
end_time: Optional[datetime] = None
tags: dict = None
@property
def duration_ms(self) -> float:
if not self.end_time:
return 0
return (self.end_time - self.start_time).total_seconds() * 1000
class TracingService:
"""Distributed tracing service."""
def __init__(self, exporter):
self.exporter = exporter
def start_trace(self, operation: str, service: str) -> Span:
"""Start a new trace."""
return Span(
trace_id=str(uuid.uuid4()),
span_id=str(uuid.uuid4()),
parent_span_id=None,
operation_name=operation,
service_name=service,
start_time=datetime.utcnow(),
tags={}
)
def start_span(
self,
trace_id: str,
parent_span_id: str,
operation: str,
service: str
) -> Span:
"""Start a child span."""
return Span(
trace_id=trace_id,
span_id=str(uuid.uuid4()),
parent_span_id=parent_span_id,
operation_name=operation,
service_name=service,
start_time=datetime.utcnow(),
tags={}
)
def finish_span(self, span: Span):
"""Finish and export span."""
span.end_time = datetime.utcnow()
self.exporter.export(span)
Part II: Alerting
Chapter 2: Alerts That Matter
2.1 Alert Definitions
# observability/alerts.py
"""
Alert Definitions
Philosophy:
- Only alert on actionable issues
- Prefer symptoms over causes
- Include runbook links
- Avoid alert fatigue
"""
ALERTS = {
"critical": [
{
"name": "NotificationDeliveryRateDropped",
"description": "Notification delivery success rate dropped below 95%",
"query": '''
sum(rate(notifications_sent_total{status="success"}[5m])) /
sum(rate(notifications_sent_total[5m])) < 0.95
''',
"for": "5m",
"severity": "critical",
"runbook": "https://runbooks.internal/notifications/low-delivery-rate",
},
{
"name": "AllProvidersDown",
"description": "All providers for a channel are unavailable",
"query": '''
sum(circuit_breaker_state{channel="push"}) == count(circuit_breaker_state{channel="push"})
''',
"for": "2m",
"severity": "critical",
"runbook": "https://runbooks.internal/notifications/all-providers-down",
},
{
"name": "QueueBacklogCritical",
"description": "Kafka consumer lag is critically high",
"query": 'sum(kafka_consumer_lag{topic=~"notifications.*"}) > 1000000',
"for": "10m",
"severity": "critical",
"runbook": "https://runbooks.internal/notifications/queue-backlog",
},
],
"warning": [
{
"name": "ProviderCircuitOpen",
"description": "A provider circuit breaker is open",
"query": 'circuit_breaker_state > 0',
"for": "5m",
"severity": "warning",
"runbook": "https://runbooks.internal/notifications/circuit-open",
},
{
"name": "HighProviderLatency",
"description": "Provider latency is elevated",
"query": '''
histogram_quantile(0.95, sum(rate(provider_latency_seconds_bucket[5m])) by (le, provider)) > 2
''',
"for": "10m",
"severity": "warning",
},
{
"name": "DLQGrowing",
"description": "Dead letter queue is accumulating messages",
"query": 'increase(kafka_consumer_lag{topic=~"dlq.*"}[1h]) > 1000',
"for": "30m",
"severity": "warning",
"runbook": "https://runbooks.internal/notifications/dlq-growing",
},
],
}
Part III: Runbooks
Chapter 3: Operational Runbooks
3.1 Runbook: Low Delivery Rate
# Runbook: Notification Delivery Rate Dropped
## Alert
NotificationDeliveryRateDropped - Delivery success rate < 95%
## Impact
Users are not receiving notifications. Critical for transactional
and security notifications.
## Quick Diagnosis (< 2 minutes)
1. Check which channel is affected:
```promql
sum(rate(notifications_sent_total{status="success"}[5m])) by (channel) /
sum(rate(notifications_sent_total[5m])) by (channel)
-
Check provider status pages:
- FCM: https://status.firebase.google.com/
- SendGrid: https://status.sendgrid.com/
- Twilio: https://status.twilio.com/
-
Check circuit breaker states:
circuit_breaker_state
Common Causes and Actions
Provider Outage
Symptoms: Circuit breaker open, provider errors in logs
Action: Failover to backup provider
kubectl set env deployment/push-worker PRIMARY_PROVIDER=apns
Invalid Tokens (Push)
Symptoms: High "UNREGISTERED" errors
Action: Run token cleanup
kubectl create job --from=cronjob/token-cleanup token-cleanup-manual
Rate Limiting
Symptoms: 429 errors from provider
Action: Reduce send rate
kubectl set env deployment/push-worker SEND_RATE_LIMIT=500
Escalation
If not resolved in 15 minutes, escalate to secondary on-call.
### 3.2 Runbook: Queue Backlog
```markdown
# Runbook: Kafka Queue Backlog
## Alert
QueueBacklogCritical - Consumer lag > 1,000,000 messages
## Impact
Notifications are delayed. User experience degraded.
## Quick Diagnosis
1. Identify affected topics:
```promql
kafka_consumer_lag by (topic)
- Check consumer health:
kubectl get pods -l app=notification-worker
Common Causes and Actions
Marketing Campaign (Expected)
- High lag on
notifications.lowtopic only - Verify critical topics are not affected
- Monitor but don't act
Worker Crash Loop
# Check logs
kubectl logs -l app=notification-worker --tail=100
# Roll back if recent deployment
kubectl rollout undo deployment/notification-worker
Scale Workers
kubectl scale deployment/notification-worker --replicas=20
Last Resort: Skip Expired Messages
For non-critical topics only:
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BROKERS \
--group notification-workers \
--topic notifications.low \
--reset-offsets --to-latest --execute
### 3.3 Runbook: Circuit Breaker Open
```markdown
# Runbook: Provider Circuit Breaker Open
## Alert
ProviderCircuitOpen - Circuit breaker in OPEN state
## Impact
Notifications for affected channel failing or using fallback.
## Quick Diagnosis
1. Identify affected provider:
```promql
circuit_breaker_state{state="open"}
-
Check provider status page
-
Check recent failures in logs
Actions by Provider
FCM
- Check Firebase status
- Verify credentials:
kubectl get secret fcm-credentials - Test manually with curl
SendGrid
- Check status page
- Verify API key
- Check IP reputation
Twilio
- Check status page
- Verify account balance (stops if $0!)
- Check number-specific issues
Force Recovery
If provider is healthy but circuit still open:
kubectl exec -it redis-0 -- redis-cli DEL circuit:fcm:state
kubectl rollout restart deployment/push-worker
---
# Part IV: Deployment
## Chapter 4: Safe Deployments
### 4.1 Deployment Strategy
```python
# deployment/strategy.py
"""
Deployment Strategy
Principles:
1. Never deploy during peak hours
2. Always use canary deployment
3. Monitor metrics before full rollout
4. Have rollback plan ready
"""
DEPLOYMENT_CONFIG = {
"api": {
"strategy": "canary",
"canary_percentage": 10,
"canary_duration_minutes": 15,
"success_criteria": {
"error_rate": "< 1%",
"latency_p99": "< 500ms",
},
"rollback_on_failure": True,
},
"workers": {
"strategy": "rolling",
"max_unavailable": "25%",
"max_surge": "25%",
"min_ready_seconds": 30,
},
}
DEPLOYMENT_CHECKLIST = """
## Pre-Deployment Checklist
### Timing
- [ ] Not during peak hours (9-11 AM, 2-4 PM)
- [ ] Not during known campaigns
- [ ] Not Friday afternoon
### Preparation
- [ ] Changelog reviewed
- [ ] Database migrations tested
- [ ] Rollback procedure documented
### Deployment
- [ ] Canary deployed
- [ ] Canary metrics healthy (wait 15 min)
- [ ] Full rollout
- [ ] Verify metrics stable (wait 30 min)
### Post-Deployment
- [ ] Smoke tests passing
- [ ] No increase in error rate
- [ ] DLQ not growing
"""
4.2 Database Migration Safety
# deployment/migrations.py
"""
Safe Database Migration Patterns
Rules:
1. Never remove columns immediately
2. Never rename columns directly
3. Always backward compatible
4. Always have rollback script
"""
MIGRATION_PATTERNS = {
"add_column": {
"safe": True,
"pattern": """
-- Safe: Adding nullable column
ALTER TABLE notifications
ADD COLUMN new_field VARCHAR(255);
""",
},
"remove_column": {
"safe": False,
"pattern": """
-- SAFE: Three-phase approach
-- Phase 1: Stop writing to column
-- Phase 2: Stop reading from column
-- Phase 3: Drop column
ALTER TABLE notifications DROP COLUMN old_field;
""",
},
"add_index": {
"safe": "depends",
"pattern": """
-- UNSAFE on large tables (locks table)
CREATE INDEX idx_notifications_user ON notifications(user_id);
-- SAFE: Concurrent (PostgreSQL)
CREATE INDEX CONCURRENTLY idx_notifications_user
ON notifications(user_id);
""",
},
}
Part V: Interview Mastery
Chapter 5: Complete Interview Walkthrough
5.1 The 45-Minute Framework
NOTIFICATION SYSTEM INTERVIEW FRAMEWORK
Time Allocation:
├── 0-5 min: Requirements clarification
├── 5-10 min: Back-of-envelope estimation
├── 10-25 min: High-level design
├── 25-40 min: Deep dives (2-3 topics)
└── 40-45 min: Wrap-up and questions
Key Topics to Cover:
1. Multi-channel delivery
2. Reliability (retries, idempotency, DLQ)
3. Scale (campaigns, hot users)
4. User preferences
5. Observability (mention briefly)
5.2 Sample Interview Dialogue
Interviewer: "Design a notification system for a fintech app."
You: "Before I start, let me clarify requirements.
What channels do we need?"
Interviewer: "Push, email, and SMS."
You: "What's our scale?"
Interviewer: "50 million users, 500 million notifications per day."
You: "That's ~5,800/sec average. Any spike patterns like campaigns?"
Interviewer: "Yes, weekly campaigns to millions."
You: "What are latency requirements?"
Interviewer: "Transaction notifications under 2 seconds. Marketing can be minutes."
You: "Let me do quick estimation.
500M/day = ~5,800/sec average, maybe 17K/sec peak during campaigns.
Storage for 90-day retention: 500M × 90 × 400 bytes = ~18TB
Cost considerations:
- Push is free
- Email ~$15K/month
- SMS is expensive, use sparingly
Here's my high-level design:
Flow: API → Kafka → Router → Channel Workers → Providers
- API Layer - receives requests, validates, returns immediately
- Kafka - buffers, handles backpressure, enables retry
- Router - fetches preferences, determines channels
- Workers - render templates, call providers
- Callbacks - delivery confirmation via webhooks
Key data stores:
- PostgreSQL for preferences and status
- Redis for caching and rate limiting
- Kafka for message queue
The key insight: separate ingestion from delivery."
Interviewer: "How do you handle reliability?"
You: "Several patterns:
Transactional Outbox: Write notification AND outbox entry in same transaction. Separate process publishes to Kafka. Never lose notifications.
Idempotency: Every notification has idempotency key. Check Redis before processing. Handles duplicates and retries.
At-Least-Once + Idempotency: Workers commit Kafka offset only after success. Combined with idempotency = effective exactly-once.
Dead Letter Queue: After 3 failures, move to DLQ for review.
Circuit Breakers: 5 failures opens circuit. Wait 30 seconds, test one request."
Interviewer: "What about 10 million notifications at once?"
You: "The campaign problem. Key strategies:
Rate-Limited Ingestion: Campaign service limits to 2,000/sec using token bucket.
Priority Queues: Separate Kafka topics:
- notifications.critical (security)
- notifications.high (transactions)
- notifications.low (marketing)
Critical never affected by campaigns.
Worker Isolation: Dedicated pools for critical vs low priority.
Backpressure: If queue depth exceeds threshold, slow campaigns or switch to degraded mode."
Interviewer: "How would you monitor this?"
You: "Three areas:
Business Metrics: Delivery rate (>98%), end-to-end latency
System Metrics: Queue depth, provider latency, circuit states
Alerts:
- Delivery <95%: page immediately
- Circuit open: warning
- Queue lag >1M: investigate
Grafana dashboards + PagerDuty for on-call."
5.3 Key Phrases
STRONG INTERVIEW PHRASES
Requirements:
"Before I design, let me clarify..."
"What's the latency SLA for different types?"
Design:
"I'm choosing X over Y because..."
"The tradeoff is A vs B. Given requirements, I'd pick..."
Scale:
"At 5,000/sec, we need to think about..."
"The bottleneck will likely be..."
Reliability:
"To ensure we never lose a notification..."
"The failure mode here is... we handle it by..."
Wrap-up:
"To summarize the key decisions..."
"If I had more time, I'd also consider..."
Part VI: Real-World Case Studies
Chapter 6: How the Giants Do It
6.1 Uber
Scale: 100M+ notifications/day
Key Insights:
1. Location-aware delivery (partitioned by city)
2. P0/P1/P2 priority classes with separate infrastructure
3. Cassandra for high-write notification status
4. Multi-tenant (Uber, Eats, Freight share platform)
6.2 Slack
Scale: Billions/day (messages, mentions, DMs)
Key Insights:
1. Mention detection at message write time
2. Presence-aware: online=WebSocket, offline=push after delay
3. Very granular preferences (per-channel mute)
4. Client-side filtering for performance
6.3 WhatsApp
Scale: 100B+ messages/day
Key Insights:
1. Erlang backend (millions of concurrent connections)
2. Store-and-forward (30-day retention for offline)
3. Push only signals "check for messages"
4. Content decrypted on device (E2E encryption)
Summary
Week 6 Complete
WEEK 6: NOTIFICATION PLATFORM
DAY 1: Problem Understanding
├── FERNS framework for requirements
├── Domain deep dive (channels, compliance)
├── Estimation (500M/day, costs)
└── High-level architecture
DAY 2: Core Flow
├── Ingestion API with validation
├── Transactional outbox pattern
├── Routing based on preferences
├── Provider integration
└── Delivery workers with retry
DAY 3: Advanced Features
├── A/B testing templates
├── Batching and digests
├── Scheduled delivery
├── Multi-channel saga
└── Real-time in-app notifications
DAY 4: Scale & Reliability
├── Campaign mode handling
├── Hot user fan-out
├── Circuit breakers
├── Provider failover
├── Edge cases (tokens, bounces, DST)
└── Cost optimization
DAY 5: Operations & Interview
├── Observability (metrics, logs, traces)
├── Alerting and runbooks
├── Deployment strategies
├── Complete interview walkthrough
└── Real-world case studies
Concepts Applied from Previous Weeks
| Week | Concept | Application |
|---|---|---|
| Week 1 | Partitioning | Kafka topics by priority |
| Week 2 | Timeouts | Provider call timeouts |
| Week 2 | Idempotency | Deduplication keys |
| Week 2 | Circuit Breakers | Provider health |
| Week 3 | Transactional Outbox | Reliable publishing |
| Week 3 | DLQ | Failed notification handling |
| Week 4 | Caching | Preferences, templates |
| Week 5 | Saga Pattern | Multi-channel delivery |
Key Takeaways
NOTIFICATION SYSTEM KEY TAKEAWAYS
1. SEPARATION OF CONCERNS
Ingestion → Routing → Delivery
2. PRIORITY ISOLATION
Critical never waits for marketing
3. RELIABILITY PATTERNS
Outbox + Idempotency + DLQ
4. PROVIDER ABSTRACTION
Circuit breakers + Failover
5. USER CONTROL
Preferences + Compliance
6. OBSERVABILITY
Metrics, logs, traces
7. COST AWARENESS
SMS is 100x email
End of Week 6
Next Week: Week 7 — Designing a Search System