Himanshu Kukreja
0%
LearnSystem DesignWeek 6Operations Monitoring Interview
Day 05

Week 6 — Day 5: Operations, Monitoring & Interview Mastery

System Design Mastery Series — Practical Application Week


Introduction

We've built a complete notification platform. Today we make it operable — the difference between a system that works and one that works at 3 AM when you're on-call.

Today's Theme: "A system you can't observe is a system you can't trust"

We'll cover:

  • Observability (metrics, logs, traces)
  • Alerting that doesn't cry wolf
  • Runbooks for common incidents
  • Deployment without downtime
  • Complete interview walkthrough

Part I: Observability

Chapter 1: The Three Pillars

1.1 Metrics — What's Happening Now

# observability/metrics.py

from dataclasses import dataclass
from enum import Enum
from typing import Optional
import time


class MetricType(Enum):
    COUNTER = "counter"      # Only goes up (requests, errors)
    GAUGE = "gauge"          # Can go up/down (queue depth, connections)
    HISTOGRAM = "histogram"  # Distribution (latency percentiles)


@dataclass
class MetricDefinition:
    name: str
    type: MetricType
    description: str
    labels: list[str]


# Core notification metrics
NOTIFICATION_METRICS = [
    # Ingestion
    MetricDefinition(
        "notifications_received_total",
        MetricType.COUNTER,
        "Total notifications received by API",
        ["priority", "type"]
    ),
    MetricDefinition(
        "notifications_validated_total",
        MetricType.COUNTER,
        "Notifications that passed validation",
        ["priority", "type"]
    ),
    MetricDefinition(
        "notifications_rejected_total",
        MetricType.COUNTER,
        "Notifications rejected by validation",
        ["reason"]
    ),
    
    # Queue
    MetricDefinition(
        "kafka_messages_produced_total",
        MetricType.COUNTER,
        "Messages published to Kafka",
        ["topic"]
    ),
    MetricDefinition(
        "kafka_consumer_lag",
        MetricType.GAUGE,
        "Consumer lag per partition",
        ["topic", "partition", "consumer_group"]
    ),
    
    # Routing
    MetricDefinition(
        "notifications_routed_total",
        MetricType.COUNTER,
        "Notifications routed to channels",
        ["channel", "priority"]
    ),
    MetricDefinition(
        "routing_duration_seconds",
        MetricType.HISTOGRAM,
        "Time to route a notification",
        ["priority"]
    ),
    
    # Delivery
    MetricDefinition(
        "notifications_sent_total",
        MetricType.COUNTER,
        "Notifications sent to providers",
        ["channel", "provider", "status"]
    ),
    MetricDefinition(
        "delivery_duration_seconds",
        MetricType.HISTOGRAM,
        "Time to deliver notification",
        ["channel", "provider"]
    ),
    MetricDefinition(
        "delivery_retries_total",
        MetricType.COUNTER,
        "Delivery retry attempts",
        ["channel", "provider", "attempt"]
    ),
    
    # Provider health
    MetricDefinition(
        "provider_requests_total",
        MetricType.COUNTER,
        "Requests to external providers",
        ["provider", "status_code"]
    ),
    MetricDefinition(
        "provider_latency_seconds",
        MetricType.HISTOGRAM,
        "Provider response latency",
        ["provider"]
    ),
    MetricDefinition(
        "circuit_breaker_state",
        MetricType.GAUGE,
        "Circuit breaker state (0=closed, 1=open, 2=half-open)",
        ["provider"]
    ),
    
    # Business metrics
    MetricDefinition(
        "notifications_delivered_total",
        MetricType.COUNTER,
        "Confirmed delivered notifications",
        ["channel", "type"]
    ),
    MetricDefinition(
        "notification_end_to_end_seconds",
        MetricType.HISTOGRAM,
        "Time from receipt to delivery confirmation",
        ["channel", "priority"]
    ),
]


class MetricsService:
    """
    Centralized metrics collection.
    
    Uses Prometheus client library conventions.
    """
    
    def __init__(self, prometheus_client):
        self.prom = prometheus_client
        self._counters = {}
        self._gauges = {}
        self._histograms = {}
        
        self._register_metrics()
    
    def _register_metrics(self):
        """Register all metric definitions."""
        for metric in NOTIFICATION_METRICS:
            if metric.type == MetricType.COUNTER:
                self._counters[metric.name] = self.prom.Counter(
                    metric.name,
                    metric.description,
                    metric.labels
                )
            elif metric.type == MetricType.GAUGE:
                self._gauges[metric.name] = self.prom.Gauge(
                    metric.name,
                    metric.description,
                    metric.labels
                )
            elif metric.type == MetricType.HISTOGRAM:
                self._histograms[metric.name] = self.prom.Histogram(
                    metric.name,
                    metric.description,
                    metric.labels,
                    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
                )
    
    def increment(self, name: str, labels: dict, value: int = 1):
        """Increment a counter."""
        if name in self._counters:
            self._counters[name].labels(**labels).inc(value)
    
    def set_gauge(self, name: str, labels: dict, value: float):
        """Set a gauge value."""
        if name in self._gauges:
            self._gauges[name].labels(**labels).set(value)
    
    def observe(self, name: str, labels: dict, value: float):
        """Observe a histogram value."""
        if name in self._histograms:
            self._histograms[name].labels(**labels).observe(value)
    
    def time(self, name: str, labels: dict):
        """Context manager to time operations."""
        return Timer(self, name, labels)


class Timer:
    """Context manager for timing operations."""
    
    def __init__(self, metrics: MetricsService, name: str, labels: dict):
        self.metrics = metrics
        self.name = name
        self.labels = labels
        self.start = None
    
    def __enter__(self):
        self.start = time.perf_counter()
        return self
    
    def __exit__(self, *args):
        duration = time.perf_counter() - self.start
        self.metrics.observe(self.name, self.labels, duration)

1.2 Key Dashboards

# observability/dashboards.py

"""
Dashboard Definitions for Grafana

Organized by persona:
1. On-call engineer - Is the system healthy?
2. Product manager - Are users getting notifications?
3. Finance - How much are we spending?
"""

DASHBOARDS = {
    "notification_overview": {
        "title": "Notification System Overview",
        "refresh": "10s",
        "panels": [
            {
                "title": "Notifications/sec by Priority",
                "type": "graph",
                "query": 'sum(rate(notifications_received_total[1m])) by (priority)'
            },
            {
                "title": "Delivery Success Rate",
                "type": "stat",
                "query": '''
                    sum(rate(notifications_sent_total{status="success"}[5m])) /
                    sum(rate(notifications_sent_total[5m])) * 100
                '''
            },
            {
                "title": "End-to-End Latency (p99)",
                "type": "graph",
                "query": 'histogram_quantile(0.99, sum(rate(notification_end_to_end_seconds_bucket[5m])) by (le, channel))'
            },
            {
                "title": "Queue Depth",
                "type": "graph",
                "query": 'sum(kafka_consumer_lag) by (topic)'
            },
        ]
    },
    
    "provider_health": {
        "title": "Provider Health",
        "panels": [
            {
                "title": "Provider Success Rate",
                "type": "graph",
                "query": '''
                    sum(rate(provider_requests_total{status_code=~"2.."}[5m])) by (provider) /
                    sum(rate(provider_requests_total[5m])) by (provider) * 100
                '''
            },
            {
                "title": "Provider Latency (p95)",
                "type": "graph",
                "query": 'histogram_quantile(0.95, sum(rate(provider_latency_seconds_bucket[5m])) by (le, provider))'
            },
            {
                "title": "Circuit Breaker Status",
                "type": "stat",
                "query": 'circuit_breaker_state',
                "mappings": {"0": "Closed", "1": "Open", "2": "Half-Open"}
            },
        ]
    },
    
    "cost_tracking": {
        "title": "Notification Costs",
        "panels": [
            {
                "title": "Daily SMS Cost",
                "type": "stat",
                "query": 'sum(increase(notifications_sent_total{channel="sms", status="success"}[24h])) * 0.01'
            },
            {
                "title": "Daily Email Cost",
                "type": "stat", 
                "query": 'sum(increase(notifications_sent_total{channel="email", status="success"}[24h])) * 0.0001'
            },
        ]
    },
}

1.3 Structured Logging

# observability/logging.py

import json
import logging
from datetime import datetime
from typing import Optional


class StructuredLogger:
    """
    Structured logging for notification events.
    
    All logs are JSON for easy parsing by log aggregators.
    """
    
    def __init__(self, service_name: str):
        self.service = service_name
        self.logger = logging.getLogger(service_name)
    
    def _log(
        self,
        level: str,
        message: str,
        notification_id: Optional[str] = None,
        user_id: Optional[str] = None,
        channel: Optional[str] = None,
        provider: Optional[str] = None,
        trace_id: Optional[str] = None,
        **extra
    ):
        """Create structured log entry."""
        
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": level,
            "service": self.service,
            "message": message,
        }
        
        if notification_id:
            entry["notification_id"] = notification_id
        if user_id:
            entry["user_id"] = user_id
        if channel:
            entry["channel"] = channel
        if provider:
            entry["provider"] = provider
        if trace_id:
            entry["trace_id"] = trace_id
        
        entry.update(extra)
        
        log_func = getattr(self.logger, level.lower())
        log_func(json.dumps(entry))
    
    def notification_received(
        self,
        notification_id: str,
        user_id: str,
        notification_type: str,
        priority: str,
        trace_id: str
    ):
        self._log(
            "INFO",
            "Notification received",
            notification_id=notification_id,
            user_id=user_id,
            trace_id=trace_id,
            notification_type=notification_type,
            priority=priority
        )
    
    def delivery_attempted(
        self,
        notification_id: str,
        channel: str,
        provider: str,
        success: bool,
        duration_ms: float,
        trace_id: str,
        error: Optional[str] = None
    ):
        level = "INFO" if success else "WARNING"
        self._log(
            level,
            "Delivery attempted",
            notification_id=notification_id,
            channel=channel,
            provider=provider,
            trace_id=trace_id,
            success=success,
            duration_ms=duration_ms,
            error=error
        )
    
    def circuit_breaker_state_change(
        self,
        provider: str,
        old_state: str,
        new_state: str,
        failure_count: int
    ):
        self._log(
            "WARNING",
            "Circuit breaker state changed",
            provider=provider,
            old_state=old_state,
            new_state=new_state,
            failure_count=failure_count
        )

1.4 Distributed Tracing

# observability/tracing.py

"""
Distributed tracing for notification flow.

Tracks a notification through:
API → Router → Worker → Provider → Callback

Example trace:

trace_id: abc-123

├─ [notification-api] receive_notification (5ms)
│  └─ tags: notification_id=xyz, user_id=123, priority=high
│
├─ [notification-api] validate (10ms)
│
├─ [notification-api] write_to_outbox (15ms)
│
├─ [notification-router] route (25ms)
│  ├─ [notification-router] fetch_preferences (8ms)
│  └─ [notification-router] publish_to_channels (15ms)
│
├─ [push-worker] deliver (150ms)
│  ├─ [push-worker] fetch_device_tokens (20ms)
│  └─ [push-worker] send_to_fcm (125ms)
│      └─ tags: provider=fcm, status=success
│
└─ [email-worker] deliver (500ms)
   └─ [email-worker] send_to_sendgrid (490ms)
       └─ tags: provider=sendgrid, status=success
"""

from dataclasses import dataclass
from datetime import datetime
from typing import Optional
import uuid


@dataclass
class Span:
    """A single span in a trace."""
    trace_id: str
    span_id: str
    parent_span_id: Optional[str]
    operation_name: str
    service_name: str
    start_time: datetime
    end_time: Optional[datetime] = None
    tags: dict = None
    
    @property
    def duration_ms(self) -> float:
        if not self.end_time:
            return 0
        return (self.end_time - self.start_time).total_seconds() * 1000


class TracingService:
    """Distributed tracing service."""
    
    def __init__(self, exporter):
        self.exporter = exporter
    
    def start_trace(self, operation: str, service: str) -> Span:
        """Start a new trace."""
        return Span(
            trace_id=str(uuid.uuid4()),
            span_id=str(uuid.uuid4()),
            parent_span_id=None,
            operation_name=operation,
            service_name=service,
            start_time=datetime.utcnow(),
            tags={}
        )
    
    def start_span(
        self,
        trace_id: str,
        parent_span_id: str,
        operation: str,
        service: str
    ) -> Span:
        """Start a child span."""
        return Span(
            trace_id=trace_id,
            span_id=str(uuid.uuid4()),
            parent_span_id=parent_span_id,
            operation_name=operation,
            service_name=service,
            start_time=datetime.utcnow(),
            tags={}
        )
    
    def finish_span(self, span: Span):
        """Finish and export span."""
        span.end_time = datetime.utcnow()
        self.exporter.export(span)

Part II: Alerting

Chapter 2: Alerts That Matter

2.1 Alert Definitions

# observability/alerts.py

"""
Alert Definitions

Philosophy:
- Only alert on actionable issues
- Prefer symptoms over causes
- Include runbook links
- Avoid alert fatigue
"""

ALERTS = {
    "critical": [
        {
            "name": "NotificationDeliveryRateDropped",
            "description": "Notification delivery success rate dropped below 95%",
            "query": '''
                sum(rate(notifications_sent_total{status="success"}[5m])) /
                sum(rate(notifications_sent_total[5m])) < 0.95
            ''',
            "for": "5m",
            "severity": "critical",
            "runbook": "https://runbooks.internal/notifications/low-delivery-rate",
        },
        {
            "name": "AllProvidersDown",
            "description": "All providers for a channel are unavailable",
            "query": '''
                sum(circuit_breaker_state{channel="push"}) == count(circuit_breaker_state{channel="push"})
            ''',
            "for": "2m",
            "severity": "critical",
            "runbook": "https://runbooks.internal/notifications/all-providers-down",
        },
        {
            "name": "QueueBacklogCritical",
            "description": "Kafka consumer lag is critically high",
            "query": 'sum(kafka_consumer_lag{topic=~"notifications.*"}) > 1000000',
            "for": "10m",
            "severity": "critical",
            "runbook": "https://runbooks.internal/notifications/queue-backlog",
        },
    ],
    
    "warning": [
        {
            "name": "ProviderCircuitOpen",
            "description": "A provider circuit breaker is open",
            "query": 'circuit_breaker_state > 0',
            "for": "5m",
            "severity": "warning",
            "runbook": "https://runbooks.internal/notifications/circuit-open",
        },
        {
            "name": "HighProviderLatency",
            "description": "Provider latency is elevated",
            "query": '''
                histogram_quantile(0.95, sum(rate(provider_latency_seconds_bucket[5m])) by (le, provider)) > 2
            ''',
            "for": "10m",
            "severity": "warning",
        },
        {
            "name": "DLQGrowing",
            "description": "Dead letter queue is accumulating messages",
            "query": 'increase(kafka_consumer_lag{topic=~"dlq.*"}[1h]) > 1000',
            "for": "30m",
            "severity": "warning",
            "runbook": "https://runbooks.internal/notifications/dlq-growing",
        },
    ],
}

Part III: Runbooks

Chapter 3: Operational Runbooks

3.1 Runbook: Low Delivery Rate

# Runbook: Notification Delivery Rate Dropped

## Alert
NotificationDeliveryRateDropped - Delivery success rate < 95%

## Impact
Users are not receiving notifications. Critical for transactional
and security notifications.

## Quick Diagnosis (< 2 minutes)

1. Check which channel is affected:
   ```promql
   sum(rate(notifications_sent_total{status="success"}[5m])) by (channel) /
   sum(rate(notifications_sent_total[5m])) by (channel)
  1. Check provider status pages:

  2. Check circuit breaker states:

    circuit_breaker_state
    

Common Causes and Actions

Provider Outage

Symptoms: Circuit breaker open, provider errors in logs

Action: Failover to backup provider

kubectl set env deployment/push-worker PRIMARY_PROVIDER=apns

Invalid Tokens (Push)

Symptoms: High "UNREGISTERED" errors

Action: Run token cleanup

kubectl create job --from=cronjob/token-cleanup token-cleanup-manual

Rate Limiting

Symptoms: 429 errors from provider

Action: Reduce send rate

kubectl set env deployment/push-worker SEND_RATE_LIMIT=500

Escalation

If not resolved in 15 minutes, escalate to secondary on-call.


### 3.2 Runbook: Queue Backlog

```markdown
# Runbook: Kafka Queue Backlog

## Alert
QueueBacklogCritical - Consumer lag > 1,000,000 messages

## Impact
Notifications are delayed. User experience degraded.

## Quick Diagnosis

1. Identify affected topics:
   ```promql
   kafka_consumer_lag by (topic)
  1. Check consumer health:
    kubectl get pods -l app=notification-worker
    

Common Causes and Actions

Marketing Campaign (Expected)

  • High lag on notifications.low topic only
  • Verify critical topics are not affected
  • Monitor but don't act

Worker Crash Loop

# Check logs
kubectl logs -l app=notification-worker --tail=100

# Roll back if recent deployment
kubectl rollout undo deployment/notification-worker

Scale Workers

kubectl scale deployment/notification-worker --replicas=20

Last Resort: Skip Expired Messages

For non-critical topics only:

kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BROKERS \
  --group notification-workers \
  --topic notifications.low \
  --reset-offsets --to-latest --execute

### 3.3 Runbook: Circuit Breaker Open

```markdown
# Runbook: Provider Circuit Breaker Open

## Alert
ProviderCircuitOpen - Circuit breaker in OPEN state

## Impact
Notifications for affected channel failing or using fallback.

## Quick Diagnosis

1. Identify affected provider:
   ```promql
   circuit_breaker_state{state="open"}
  1. Check provider status page

  2. Check recent failures in logs

Actions by Provider

FCM

  • Check Firebase status
  • Verify credentials: kubectl get secret fcm-credentials
  • Test manually with curl

SendGrid

  • Check status page
  • Verify API key
  • Check IP reputation

Twilio

  • Check status page
  • Verify account balance (stops if $0!)
  • Check number-specific issues

Force Recovery

If provider is healthy but circuit still open:

kubectl exec -it redis-0 -- redis-cli DEL circuit:fcm:state
kubectl rollout restart deployment/push-worker

---

# Part IV: Deployment

## Chapter 4: Safe Deployments

### 4.1 Deployment Strategy

```python
# deployment/strategy.py

"""
Deployment Strategy

Principles:
1. Never deploy during peak hours
2. Always use canary deployment
3. Monitor metrics before full rollout
4. Have rollback plan ready
"""

DEPLOYMENT_CONFIG = {
    "api": {
        "strategy": "canary",
        "canary_percentage": 10,
        "canary_duration_minutes": 15,
        "success_criteria": {
            "error_rate": "< 1%",
            "latency_p99": "< 500ms",
        },
        "rollback_on_failure": True,
    },
    
    "workers": {
        "strategy": "rolling",
        "max_unavailable": "25%",
        "max_surge": "25%",
        "min_ready_seconds": 30,
    },
}


DEPLOYMENT_CHECKLIST = """
## Pre-Deployment Checklist

### Timing
- [ ] Not during peak hours (9-11 AM, 2-4 PM)
- [ ] Not during known campaigns
- [ ] Not Friday afternoon

### Preparation
- [ ] Changelog reviewed
- [ ] Database migrations tested
- [ ] Rollback procedure documented

### Deployment
- [ ] Canary deployed
- [ ] Canary metrics healthy (wait 15 min)
- [ ] Full rollout
- [ ] Verify metrics stable (wait 30 min)

### Post-Deployment
- [ ] Smoke tests passing
- [ ] No increase in error rate
- [ ] DLQ not growing
"""

4.2 Database Migration Safety

# deployment/migrations.py

"""
Safe Database Migration Patterns

Rules:
1. Never remove columns immediately
2. Never rename columns directly
3. Always backward compatible
4. Always have rollback script
"""

MIGRATION_PATTERNS = {
    "add_column": {
        "safe": True,
        "pattern": """
            -- Safe: Adding nullable column
            ALTER TABLE notifications 
            ADD COLUMN new_field VARCHAR(255);
        """,
    },
    
    "remove_column": {
        "safe": False,
        "pattern": """
            -- SAFE: Three-phase approach
            -- Phase 1: Stop writing to column
            -- Phase 2: Stop reading from column  
            -- Phase 3: Drop column
            ALTER TABLE notifications DROP COLUMN old_field;
        """,
    },
    
    "add_index": {
        "safe": "depends",
        "pattern": """
            -- UNSAFE on large tables (locks table)
            CREATE INDEX idx_notifications_user ON notifications(user_id);
            
            -- SAFE: Concurrent (PostgreSQL)
            CREATE INDEX CONCURRENTLY idx_notifications_user 
            ON notifications(user_id);
        """,
    },
}

Part V: Interview Mastery

Chapter 5: Complete Interview Walkthrough

5.1 The 45-Minute Framework

NOTIFICATION SYSTEM INTERVIEW FRAMEWORK

Time Allocation:
├── 0-5 min:   Requirements clarification
├── 5-10 min:  Back-of-envelope estimation
├── 10-25 min: High-level design
├── 25-40 min: Deep dives (2-3 topics)
└── 40-45 min: Wrap-up and questions

Key Topics to Cover:
1. Multi-channel delivery
2. Reliability (retries, idempotency, DLQ)
3. Scale (campaigns, hot users)
4. User preferences
5. Observability (mention briefly)

5.2 Sample Interview Dialogue

Interviewer: "Design a notification system for a fintech app."

You: "Before I start, let me clarify requirements.

What channels do we need?"

Interviewer: "Push, email, and SMS."

You: "What's our scale?"

Interviewer: "50 million users, 500 million notifications per day."

You: "That's ~5,800/sec average. Any spike patterns like campaigns?"

Interviewer: "Yes, weekly campaigns to millions."

You: "What are latency requirements?"

Interviewer: "Transaction notifications under 2 seconds. Marketing can be minutes."


You: "Let me do quick estimation.

500M/day = ~5,800/sec average, maybe 17K/sec peak during campaigns.

Storage for 90-day retention: 500M × 90 × 400 bytes = ~18TB

Cost considerations:

  • Push is free
  • Email ~$15K/month
  • SMS is expensive, use sparingly

Here's my high-level design:

Flow: API → Kafka → Router → Channel Workers → Providers

  1. API Layer - receives requests, validates, returns immediately
  2. Kafka - buffers, handles backpressure, enables retry
  3. Router - fetches preferences, determines channels
  4. Workers - render templates, call providers
  5. Callbacks - delivery confirmation via webhooks

Key data stores:

  • PostgreSQL for preferences and status
  • Redis for caching and rate limiting
  • Kafka for message queue

The key insight: separate ingestion from delivery."


Interviewer: "How do you handle reliability?"

You: "Several patterns:

Transactional Outbox: Write notification AND outbox entry in same transaction. Separate process publishes to Kafka. Never lose notifications.

Idempotency: Every notification has idempotency key. Check Redis before processing. Handles duplicates and retries.

At-Least-Once + Idempotency: Workers commit Kafka offset only after success. Combined with idempotency = effective exactly-once.

Dead Letter Queue: After 3 failures, move to DLQ for review.

Circuit Breakers: 5 failures opens circuit. Wait 30 seconds, test one request."


Interviewer: "What about 10 million notifications at once?"

You: "The campaign problem. Key strategies:

Rate-Limited Ingestion: Campaign service limits to 2,000/sec using token bucket.

Priority Queues: Separate Kafka topics:

  • notifications.critical (security)
  • notifications.high (transactions)
  • notifications.low (marketing)

Critical never affected by campaigns.

Worker Isolation: Dedicated pools for critical vs low priority.

Backpressure: If queue depth exceeds threshold, slow campaigns or switch to degraded mode."


Interviewer: "How would you monitor this?"

You: "Three areas:

Business Metrics: Delivery rate (>98%), end-to-end latency

System Metrics: Queue depth, provider latency, circuit states

Alerts:

  • Delivery <95%: page immediately
  • Circuit open: warning
  • Queue lag >1M: investigate

Grafana dashboards + PagerDuty for on-call."

5.3 Key Phrases

STRONG INTERVIEW PHRASES

Requirements:
"Before I design, let me clarify..."
"What's the latency SLA for different types?"

Design:
"I'm choosing X over Y because..."
"The tradeoff is A vs B. Given requirements, I'd pick..."

Scale:
"At 5,000/sec, we need to think about..."
"The bottleneck will likely be..."

Reliability:
"To ensure we never lose a notification..."
"The failure mode here is... we handle it by..."

Wrap-up:
"To summarize the key decisions..."
"If I had more time, I'd also consider..."

Part VI: Real-World Case Studies

Chapter 6: How the Giants Do It

6.1 Uber

Scale: 100M+ notifications/day

Key Insights:
1. Location-aware delivery (partitioned by city)
2. P0/P1/P2 priority classes with separate infrastructure
3. Cassandra for high-write notification status
4. Multi-tenant (Uber, Eats, Freight share platform)

6.2 Slack

Scale: Billions/day (messages, mentions, DMs)

Key Insights:
1. Mention detection at message write time
2. Presence-aware: online=WebSocket, offline=push after delay
3. Very granular preferences (per-channel mute)
4. Client-side filtering for performance

6.3 WhatsApp

Scale: 100B+ messages/day

Key Insights:
1. Erlang backend (millions of concurrent connections)
2. Store-and-forward (30-day retention for offline)
3. Push only signals "check for messages"
4. Content decrypted on device (E2E encryption)

Summary

Week 6 Complete

WEEK 6: NOTIFICATION PLATFORM

DAY 1: Problem Understanding
├── FERNS framework for requirements
├── Domain deep dive (channels, compliance)
├── Estimation (500M/day, costs)
└── High-level architecture

DAY 2: Core Flow
├── Ingestion API with validation
├── Transactional outbox pattern
├── Routing based on preferences
├── Provider integration
└── Delivery workers with retry

DAY 3: Advanced Features
├── A/B testing templates
├── Batching and digests
├── Scheduled delivery
├── Multi-channel saga
└── Real-time in-app notifications

DAY 4: Scale & Reliability
├── Campaign mode handling
├── Hot user fan-out
├── Circuit breakers
├── Provider failover
├── Edge cases (tokens, bounces, DST)
└── Cost optimization

DAY 5: Operations & Interview
├── Observability (metrics, logs, traces)
├── Alerting and runbooks
├── Deployment strategies
├── Complete interview walkthrough
└── Real-world case studies

Concepts Applied from Previous Weeks

Week Concept Application
Week 1 Partitioning Kafka topics by priority
Week 2 Timeouts Provider call timeouts
Week 2 Idempotency Deduplication keys
Week 2 Circuit Breakers Provider health
Week 3 Transactional Outbox Reliable publishing
Week 3 DLQ Failed notification handling
Week 4 Caching Preferences, templates
Week 5 Saga Pattern Multi-channel delivery

Key Takeaways

NOTIFICATION SYSTEM KEY TAKEAWAYS

1. SEPARATION OF CONCERNS
   Ingestion → Routing → Delivery

2. PRIORITY ISOLATION
   Critical never waits for marketing

3. RELIABILITY PATTERNS
   Outbox + Idempotency + DLQ

4. PROVIDER ABSTRACTION
   Circuit breakers + Failover

5. USER CONTROL
   Preferences + Compliance

6. OBSERVABILITY
   Metrics, logs, traces

7. COST AWARENESS
   SMS is 100x email

End of Week 6

Next Week: Week 7 — Designing a Search System