Himanshu Kukreja
0%
Day 02

Week 10 — Day 2: Observability

System Design Mastery Series — Production Readiness and Operational Excellence


Preface

Yesterday, you learned to define what "healthy" means with SLOs.

But how do you actually KNOW if you're meeting them?

THE OBSERVABILITY PROBLEM

3:47 AM. Your phone buzzes.

ALERT: SLO violation - API availability dropped to 98.2%

You open your laptop, bleary-eyed. Questions flood your mind:

├── Which endpoint is failing?
├── Which users are affected?
├── When did it start?
├── What changed?
├── Is it getting worse or recovering?
├── What's the root cause?
└── How do I fix it?

Without observability, you're blind.
You're guessing.
You're hoping.

WITH observability:
├── Dashboard shows: /api/payments endpoint, 5xx errors
├── Logs show: "Connection refused to payment-service-3"
├── Traces show: Requests to payment-service-3 timing out
├── Metrics show: payment-service-3 CPU at 100% since 3:42 AM
└── Recent deploys show: Config change at 3:40 AM

Root cause found in 5 minutes. Rollback config. Done.

This is the power of observability.

Today, we learn to see inside our systems.


Part I: Foundations

Chapter 1: What Is Observability?

1.1 Observability vs Monitoring

These terms are often confused. Let's clarify:

MONITORING VS OBSERVABILITY

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  MONITORING                                                            │
│  ──────────                                                            │
│  "Watching for known problems"                                         │
│                                                                        │
│  ├── Pre-defined dashboards                                            │
│  ├── Pre-defined alerts                                                │
│  ├── Checks: "Is CPU > 80%? Is error rate > 1%?"                       │
│  ├── Good for: Known failure modes                                     │
│  └── Limitation: Can't find unknown problems                           │
│                                                                        │
│  Example: Alert when database connections > 90% of pool                │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  OBSERVABILITY                                                         │
│  ─────────────                                                         │
│  "Understanding system behavior from its outputs"                      │
│                                                                        │
│  ├── Ask arbitrary questions                                           │
│  ├── Explore unknown problems                                          │
│  ├── Questions: "Why is this user's request slow?"                     │
│  ├── Good for: Novel failure modes, debugging                          │
│  └── Enables: Answering questions you didn't anticipate                │
│                                                                        │
│  Example: "Show me all requests from user X in the last hour,          │
│            grouped by endpoint, colored by latency"                    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

KEY INSIGHT:
├── Monitoring tells you WHEN something is wrong
├── Observability helps you understand WHY
└── You need both

1.2 The Three Pillars

THE THREE PILLARS OF OBSERVABILITY

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│                         OBSERVABILITY                                  │
│                              │                                         │
│              ┌───────────────┼───────────────┐                         │
│              │               │               │                         │
│              ▼               ▼               ▼                         │
│        ┌──────────┐   ┌──────────┐   ┌──────────┐                      │
│        │ METRICS  │   │   LOGS   │   │  TRACES  │                      │
│        └────┬─────┘   └────┬─────┘   └────┬─────┘                      │
│             │              │              │                            │
│        Aggregated     Discrete      Request                            │
│        Numbers        Events        Journeys                           │
│             │              │              │                            │
│        "What is      "What         "What path                          │
│         happening     happened       did this                          │
│         overall?"     exactly?"      request take?"                    │
│             │              │              │                            │
│        Examples:     Examples:      Examples:                          │
│        - CPU: 72%    - Error at     - API → Auth                       │
│        - Req/s: 450    3:42:17       → DB → Cache                      │
│        - p99: 180ms  - User X       - 47ms → 12ms                      │
│        - Errors: 12    logged in      → 3ms → 2ms                      │
│                                                                        │
│        AGGREGATED    INDIVIDUAL     CORRELATED                         │
│        EFFICIENT     DETAILED       CONTEXTUAL                         │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

How they work together:

1. METRICS alert you: "Error rate spiked to 5%"
2. LOGS tell you what: "NullPointerException in PaymentService"
3. TRACES show you where: "Failure happens after auth, before DB"

Each pillar has strengths and weaknesses:

│ Pillar  │ Cardinality │ Cost    │ Best For                        │
│─────────│─────────────│─────────│─────────────────────────────────│
│ Metrics │ Low         │ Cheap   │ Alerting, trends, dashboards    │
│ Logs    │ High        │ Medium  │ Debugging, audit, compliance    │
│ Traces  │ Medium      │ High    │ Request flow, latency breakdown │

1.3 Correlation: The Glue

The real power comes from connecting the three pillars:

CORRELATION EXAMPLE

User reports: "My payment failed"

Step 1: Find the request
├── User ID: user-12345
├── Time: ~10 minutes ago
└── Search logs: user_id="user-12345" AND action="payment"

Step 2: Get trace ID from log
├── Log entry found: {"trace_id": "abc-123", "error": "timeout"}
└── Open trace abc-123 in tracing UI

Step 3: Trace shows the journey
├── api-gateway: 2ms ✓
├── payment-service: 45ms ✓
├── fraud-check: 3ms ✓
├── bank-api: TIMEOUT after 30s ✗
└── Root cause: Bank API slow

Step 4: Check metrics
├── bank_api_latency_p99 shows spike at that time
├── Other users affected too
└── 47 timeouts in last 10 minutes

Step 5: Resolution
├── Bank API had an incident
├── Affected 47 payments
├── Auto-retry recovered 38
└── 9 need manual intervention

THIS IS OBSERVABILITY IN ACTION.
Without correlation, this takes hours.
With correlation, it takes minutes.

Chapter 2: Metrics Deep Dive

2.1 Metric Types

THE FOUR METRIC TYPES

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  1. COUNTER                                                            │
│     ─────────                                                          │
│     A value that only goes UP (or resets to zero)                      │
│                                                                        │
│     Examples:                                                          │
│     ├── requests_total                                                 │
│     ├── errors_total                                                   │
│     ├── bytes_sent_total                                               │
│     └── orders_processed_total                                         │
│                                                                        │
│     Use for: Counting events                                           │
│     Query: rate(requests_total[5m]) → requests per second              │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  2. GAUGE                                                              │
│     ───────                                                            │
│     A value that can go UP or DOWN                                     │
│                                                                        │
│     Examples:                                                          │
│     ├── temperature_celsius                                            │
│     ├── queue_depth                                                    │
│     ├── active_connections                                             │
│     └── memory_used_bytes                                              │
│                                                                        │
│     Use for: Current state                                             │
│     Query: avg_over_time(queue_depth[5m]) → average queue depth        │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  3. HISTOGRAM                                                          │
│     ──────────                                                         │
│     Samples observations and counts them in buckets                    │
│                                                                        │
│     Examples:                                                          │
│     ├── request_duration_seconds                                       │
│     ├── response_size_bytes                                            │
│     └── batch_job_duration_seconds                                     │
│                                                                        │
│     Buckets: [0.01, 0.05, 0.1, 0.5, 1, 5, 10] seconds                  │
│                                                                        │
│     Use for: Latencies, sizes, durations                               │
│     Query: histogram_quantile(0.99, rate(duration_bucket[5m]))         │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  4. SUMMARY                                                            │
│     ────────                                                           │
│     Pre-calculated quantiles (less flexible than histogram)            │
│                                                                        │
│     Examples:                                                          │
│     ├── request_duration{quantile="0.99"}                              │
│     └── response_size{quantile="0.5"}                                  │
│                                                                        │
│     Use for: When you know exactly which quantiles you need            │
│     Note: Can't aggregate across instances (prefer histograms)         │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

2.2 Metric Naming and Labels

# metrics/naming.py

"""
Metric naming conventions and best practices.

Good metric names are:
- Descriptive: What does it measure?
- Consistent: Same format across services
- Unit-aware: Include the unit in the name
"""

from prometheus_client import Counter, Gauge, Histogram, Info
from typing import List

# =============================================================================
# NAMING CONVENTION
# =============================================================================
#
# Format: <namespace>_<subsystem>_<name>_<unit>
#
# Examples:
#   http_requests_total          (counter, no unit needed for counts)
#   http_request_duration_seconds (histogram with unit)
#   process_memory_bytes         (gauge with unit)
#   api_active_connections       (gauge, no unit for count)
#
# Labels should be:
#   - Low cardinality (not user_id, not request_id)
#   - Meaningful dimensions (method, status, endpoint)
#   - Consistent across metrics

# =============================================================================
# GOOD METRIC DEFINITIONS
# =============================================================================

# Counter for requests
http_requests_total = Counter(
    name='http_requests_total',
    documentation='Total number of HTTP requests',
    labelnames=['method', 'endpoint', 'status']
)

# Histogram for latency
http_request_duration_seconds = Histogram(
    name='http_request_duration_seconds',
    documentation='HTTP request latency in seconds',
    labelnames=['method', 'endpoint'],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)

# Gauge for current state
db_connections_active = Gauge(
    name='db_connections_active',
    documentation='Number of active database connections',
    labelnames=['pool_name']
)

# Business metrics
orders_processed_total = Counter(
    name='orders_processed_total',
    documentation='Total orders processed',
    labelnames=['status', 'payment_method']
)

order_value_dollars = Histogram(
    name='order_value_dollars',
    documentation='Order value in dollars',
    labelnames=['currency'],
    buckets=[10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
)


# =============================================================================
# BAD METRIC DEFINITIONS (DON'T DO THIS)
# =============================================================================

# ❌ BAD: High cardinality labels
# http_requests_total with labels: user_id, request_id, session_id
# This creates millions of time series!

# ❌ BAD: Inconsistent naming
# requestCount (camelCase)
# request-latency (dashes)
# Request_Duration_Ms (mixed case, abbreviated unit)

# ❌ BAD: Missing unit
# http_request_duration (seconds? milliseconds? who knows?)

# ❌ BAD: Too generic
# count (count of what?)
# duration (duration of what?)


# =============================================================================
# LABEL CARDINALITY GUIDELINES
# =============================================================================

"""
SAFE LABELS (bounded cardinality):
├── method: GET, POST, PUT, DELETE (~10 values)
├── status: 200, 201, 400, 404, 500 (~20 values)
├── endpoint: /api/users, /api/orders (~100 values)
├── service: user-service, payment-service (~50 values)
├── region: us-east, eu-west (~10 values)
└── tier: free, pro, enterprise (3 values)

DANGEROUS LABELS (unbounded cardinality):
├── user_id: Millions of users = millions of time series
├── request_id: Unique per request = infinite time series
├── email: Unbounded
├── ip_address: Could be millions
└── timestamp: Never use as a label!

RULE OF THUMB:
Total time series = product of all label cardinalities
http_requests{method, endpoint, status} = 5 × 100 × 20 = 10,000 series
This is manageable.

http_requests{method, endpoint, status, user_id} = 5 × 100 × 20 × 1,000,000
= 10 BILLION series. This will crash your metrics system.
"""

2.3 RED and USE Methods

TWO FRAMEWORKS FOR WHAT TO MEASURE

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  RED METHOD (for services)                                             │
│  ─────────────────────────                                             │
│                                                                        │
│  R - Rate:     Requests per second                                     │
│  E - Errors:   Failed requests per second                              │
│  D - Duration: Time per request (usually as histogram)                 │
│                                                                        │
│  Use for: APIs, web services, microservices                            │
│                                                                        │
│  Example metrics:                                                      │
│  ├── http_requests_total (rate)                                        │
│  ├── http_requests_total{status="5xx"} (errors)                        │
│  └── http_request_duration_seconds (duration)                          │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  USE METHOD (for resources)                                            │
│  ──────────────────────────                                            │
│                                                                        │
│  U - Utilization: % of resource busy                                   │
│  S - Saturation:  Queue depth / waiting work                           │
│  E - Errors:      Error count                                          │
│                                                                        │
│  Use for: Databases, queues, caches, CPUs, disks                       │
│                                                                        │
│  Example for database:                                                 │
│  ├── Utilization: Active connections / max connections                 │
│  ├── Saturation: Queries waiting for connection                        │
│  └── Errors: Connection failures, query errors                         │
│                                                                        │
│  Example for queue:                                                    │
│  ├── Utilization: Consumer throughput / max throughput                 │
│  ├── Saturation: Queue depth, consumer lag                             │
│  └── Errors: Processing failures, dead letters                        │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Chapter 3: Logs Deep Dive

3.1 Structured Logging

# logging/structured.py

"""
Structured logging for observability.

Key principles:
1. JSON format for machine parsing
2. Consistent field names
3. Include context (trace_id, user_id, etc.)
4. Appropriate log levels
"""

import json
import logging
import sys
from datetime import datetime
from typing import Any, Dict, Optional
from contextvars import ContextVar

# Context variables for request-scoped data
_request_context: ContextVar[Dict[str, Any]] = ContextVar(
    'request_context', 
    default={}
)


class StructuredFormatter(logging.Formatter):
    """
    Formats log records as JSON for structured logging.
    """
    
    STANDARD_FIELDS = [
        'timestamp',
        'level',
        'logger',
        'message',
        'trace_id',
        'span_id',
        'user_id',
        'tenant_id',
        'service',
        'environment',
    ]
    
    def __init__(self, service_name: str, environment: str):
        super().__init__()
        self.service_name = service_name
        self.environment = environment
    
    def format(self, record: logging.LogRecord) -> str:
        # Base fields
        log_entry = {
            'timestamp': datetime.utcnow().isoformat() + 'Z',
            'level': record.levelname,
            'logger': record.name,
            'message': record.getMessage(),
            'service': self.service_name,
            'environment': self.environment,
        }
        
        # Add context from request
        ctx = _request_context.get()
        if ctx:
            log_entry['trace_id'] = ctx.get('trace_id')
            log_entry['span_id'] = ctx.get('span_id')
            log_entry['user_id'] = ctx.get('user_id')
            log_entry['tenant_id'] = ctx.get('tenant_id')
            log_entry['request_id'] = ctx.get('request_id')
        
        # Add extra fields from the log call
        if hasattr(record, 'extra_fields'):
            log_entry.update(record.extra_fields)
        
        # Add exception info if present
        if record.exc_info:
            log_entry['exception'] = {
                'type': record.exc_info[0].__name__,
                'message': str(record.exc_info[1]),
                'traceback': self.formatException(record.exc_info)
            }
        
        # Add source location for errors
        if record.levelno >= logging.ERROR:
            log_entry['source'] = {
                'file': record.pathname,
                'line': record.lineno,
                'function': record.funcName
            }
        
        return json.dumps(log_entry, default=str)


class ContextLogger:
    """
    Logger that automatically includes context in all log messages.
    """
    
    def __init__(self, name: str):
        self.logger = logging.getLogger(name)
    
    def _log(self, level: int, message: str, **kwargs):
        """Log with extra fields."""
        record = self.logger.makeRecord(
            self.logger.name,
            level,
            '',  # fn
            0,   # lno
            message,
            (),  # args
            None  # exc_info
        )
        record.extra_fields = kwargs
        self.logger.handle(record)
    
    def debug(self, message: str, **kwargs):
        self._log(logging.DEBUG, message, **kwargs)
    
    def info(self, message: str, **kwargs):
        self._log(logging.INFO, message, **kwargs)
    
    def warning(self, message: str, **kwargs):
        self._log(logging.WARNING, message, **kwargs)
    
    def error(self, message: str, **kwargs):
        self._log(logging.ERROR, message, **kwargs)
    
    def critical(self, message: str, **kwargs):
        self._log(logging.CRITICAL, message, **kwargs)


def set_request_context(
    trace_id: str,
    span_id: str = None,
    user_id: str = None,
    tenant_id: str = None,
    request_id: str = None
):
    """Set context for the current request."""
    _request_context.set({
        'trace_id': trace_id,
        'span_id': span_id,
        'user_id': user_id,
        'tenant_id': tenant_id,
        'request_id': request_id
    })


# =============================================================================
# USAGE EXAMPLE
# =============================================================================

def setup_logging(service_name: str, environment: str):
    """Configure structured logging for the application."""
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(StructuredFormatter(service_name, environment))
    
    root_logger = logging.getLogger()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)


# Example usage
logger = ContextLogger(__name__)

async def process_payment(payment_id: str, amount: float):
    """Process a payment with structured logging."""
    
    logger.info(
        "Processing payment",
        payment_id=payment_id,
        amount=amount,
        currency="USD"
    )
    
    try:
        result = await payment_gateway.charge(payment_id, amount)
        
        logger.info(
            "Payment successful",
            payment_id=payment_id,
            transaction_id=result.transaction_id,
            processing_time_ms=result.processing_time
        )
        
        return result
        
    except PaymentDeclinedException as e:
        logger.warning(
            "Payment declined",
            payment_id=payment_id,
            decline_reason=e.reason,
            decline_code=e.code
        )
        raise
        
    except Exception as e:
        logger.error(
            "Payment processing failed",
            payment_id=payment_id,
            error_type=type(e).__name__,
            error_message=str(e)
        )
        raise


# Output example (single line, formatted here for readability):
# {
#   "timestamp": "2024-01-15T10:30:00.000Z",
#   "level": "INFO",
#   "logger": "payment_service",
#   "message": "Processing payment",
#   "service": "payment-api",
#   "environment": "production",
#   "trace_id": "abc-123-def-456",
#   "user_id": "user-789",
#   "tenant_id": "tenant-001",
#   "payment_id": "pay-12345",
#   "amount": 99.99,
#   "currency": "USD"
# }

3.2 Log Levels and When to Use Them

LOG LEVEL GUIDELINES

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  DEBUG                                                                 │
│  ─────                                                                 │
│  Detailed information for debugging.                                   │
│  NOT enabled in production by default.                                 │
│                                                                        │
│  Use for:                                                              │
│  ├── Variable values during execution                                  │
│  ├── Entering/exiting functions                                        │
│  ├── Loop iterations                                                   │
│  └── Detailed state dumps                                              │
│                                                                        │
│  Example: "Processing item 3 of 100: product_id=ABC, price=19.99"      │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  INFO                                                                  │
│  ────                                                                  │
│  Normal operation events.                                              │
│  Enabled in production.                                                │
│                                                                        │
│  Use for:                                                              │
│  ├── Request received/completed                                        │
│  ├── Business events (order placed, payment processed)                 │
│  ├── Configuration loaded                                              │
│  └── Service started/stopped                                           │
│                                                                        │
│  Example: "Order created: order_id=12345, total=99.99"                 │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  WARNING                                                               │
│  ───────                                                               │
│  Something unexpected but handled.                                     │
│  System continues to function.                                         │
│                                                                        │
│  Use for:                                                              │
│  ├── Deprecated API usage                                              │
│  ├── Retryable errors (retry succeeded)                                │
│  ├── Resource usage approaching limits                                 │
│  ├── Fallback to default behavior                                      │
│  └── Expected business exceptions                                      │
│                                                                        │
│  Example: "Payment declined, will retry: attempt 2 of 3"               │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  ERROR                                                                 │
│  ─────                                                                 │
│  Something failed. Requires attention.                                 │
│  May affect individual requests.                                       │
│                                                                        │
│  Use for:                                                              │
│  ├── Unhandled exceptions                                              │
│  ├── Failed operations that can't be retried                           │
│  ├── Integration failures                                              │
│  └── Data corruption detected                                          │
│                                                                        │
│  Example: "Payment processing failed after 3 retries: timeout"         │
│                                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│                                                                        │
│  CRITICAL                                                              │
│  ────────                                                              │
│  System cannot continue.                                               │
│  Immediate attention required.                                         │
│                                                                        │
│  Use for:                                                              │
│  ├── Database connection pool exhausted                                │
│  ├── Out of memory                                                     │
│  ├── Critical configuration missing                                    │
│  ├── Security breach detected                                          │
│  └── Data loss imminent                                                │
│                                                                        │
│  Example: "Database connection pool exhausted, rejecting requests"     │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

3.3 What to Log and What NOT to Log

LOGGING DO'S AND DON'TS

✅ DO LOG:
├── Request ID / Trace ID (for correlation)
├── User ID (for debugging user issues)
├── Tenant ID (for multi-tenant debugging)
├── Business events (order created, payment processed)
├── Errors with context
├── Performance metrics (processing time)
├── State transitions (order: pending → paid → shipped)
└── External API calls (request and response status)

❌ DON'T LOG:
├── Passwords or credentials (NEVER!)
├── Full credit card numbers (log last 4 only: ****1234)
├── Personal health information
├── Full API keys (log prefix only: sk_live_abc...)
├── Session tokens
├── Personally identifiable information in excess
├── Large payloads (log size, not content)
└── High-frequency events without sampling

SENSITIVE DATA HANDLING:

# ❌ WRONG
logger.info(f"User login: email={email}, password={password}")

# ✅ RIGHT  
logger.info("User login", email=email)  # No password!

# ❌ WRONG
logger.info(f"Payment: card={card_number}")

# ✅ RIGHT
logger.info("Payment", card_last_four=card_number[-4:])

# ❌ WRONG
logger.debug(f"Request body: {request.body}")  # Could be huge

# ✅ RIGHT
logger.debug("Request received", content_length=len(request.body))

Chapter 4: Distributed Tracing

4.1 Tracing Concepts

DISTRIBUTED TRACING CONCEPTS

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  THE PROBLEM                                                           │
│  ───────────                                                           │
│                                                                        │
│  A single user request touches many services:                          │
│                                                                        │
│  User → API Gateway → Auth → Users → Database                          │
│                            → Orders → Database                         │
│                                    → Payments → Bank API               │
│                                    → Notifications → Email Provider    │
│                                                                        │
│  When something is slow, WHERE is it slow?                             │
│  When something fails, WHERE did it fail?                              │
│                                                                        │
│  THE SOLUTION: DISTRIBUTED TRACING                                     │
│  ─────────────────────────────────                                     │
│                                                                        │
│  A TRACE represents the entire journey of a request.                   │
│  A SPAN represents one step in that journey.                           │
│                                                                        │
│  Trace: abc-123                                                        │
│  │                                                                     │
│  ├── Span: api-gateway (12ms)                                          │
│  │   └── Span: auth-service (5ms)                                      │
│  │       └── Span: token-validation (2ms)                              │
│  │                                                                     │
│  ├── Span: order-service (150ms)                                       │
│  │   ├── Span: db-query-orders (8ms)                                   │
│  │   ├── Span: payment-service (120ms)                                 │
│  │   │   └── Span: bank-api-call (100ms) ← SLOW!                       │
│  │   └── Span: notification-service (15ms)                             │
│  │       └── Span: email-send (10ms)                                   │
│  │                                                                     │
│  └── Total: 162ms                                                      │
│                                                                        │
│  Now we can see: bank-api-call took 100ms of our 162ms total.          │
│  That's where to focus optimization efforts.                           │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

4.2 OpenTelemetry Implementation

# tracing/opentelemetry_setup.py

"""
OpenTelemetry tracing setup and utilities.

OpenTelemetry is the industry standard for tracing.
It provides consistent APIs across languages.
"""

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.trace import Status, StatusCode
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.b3 import B3MultiFormat
from contextlib import contextmanager
from typing import Optional, Dict, Any
import functools


def setup_tracing(
    service_name: str,
    otlp_endpoint: str = "localhost:4317",
    environment: str = "development"
):
    """
    Configure OpenTelemetry tracing for the application.
    """
    # Create tracer provider
    provider = TracerProvider(
        resource=Resource.create({
            "service.name": service_name,
            "deployment.environment": environment,
        })
    )
    
    # Export spans to collector (Jaeger, Zipkin, etc.)
    exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    
    # Set as global provider
    trace.set_tracer_provider(provider)
    
    # Set up context propagation (B3 format for compatibility)
    set_global_textmap(B3MultiFormat())
    
    # Auto-instrument common libraries
    FastAPIInstrumentor().instrument()
    HTTPXClientInstrumentor().instrument()
    SQLAlchemyInstrumentor().instrument()
    RedisInstrumentor().instrument()
    
    return trace.get_tracer(service_name)


class Tracing:
    """
    Tracing utilities for manual instrumentation.
    """
    
    def __init__(self, service_name: str):
        self.tracer = trace.get_tracer(service_name)
    
    @contextmanager
    def span(
        self,
        name: str,
        attributes: Optional[Dict[str, Any]] = None
    ):
        """
        Create a span within the current trace context.
        
        Usage:
            with tracing.span("process_payment", {"payment_id": "123"}):
                # ... do work ...
        """
        with self.tracer.start_as_current_span(name) as span:
            if attributes:
                for key, value in attributes.items():
                    span.set_attribute(key, value)
            
            try:
                yield span
            except Exception as e:
                span.set_status(Status(StatusCode.ERROR, str(e)))
                span.record_exception(e)
                raise
    
    def traced(
        self,
        name: Optional[str] = None,
        attributes: Optional[Dict[str, Any]] = None
    ):
        """
        Decorator to trace a function.
        
        Usage:
            @tracing.traced("process_order")
            async def process_order(order_id: str):
                ...
        """
        def decorator(func):
            span_name = name or func.__name__
            
            @functools.wraps(func)
            async def async_wrapper(*args, **kwargs):
                with self.span(span_name, attributes):
                    return await func(*args, **kwargs)
            
            @functools.wraps(func)
            def sync_wrapper(*args, **kwargs):
                with self.span(span_name, attributes):
                    return func(*args, **kwargs)
            
            if asyncio.iscoroutinefunction(func):
                return async_wrapper
            return sync_wrapper
        
        return decorator
    
    def add_event(self, name: str, attributes: Dict[str, Any] = None):
        """Add an event to the current span."""
        span = trace.get_current_span()
        span.add_event(name, attributes or {})
    
    def set_attribute(self, key: str, value: Any):
        """Set an attribute on the current span."""
        span = trace.get_current_span()
        span.set_attribute(key, value)
    
    def get_trace_id(self) -> Optional[str]:
        """Get the current trace ID."""
        span = trace.get_current_span()
        if span:
            return format(span.get_span_context().trace_id, '032x')
        return None


# =============================================================================
# USAGE EXAMPLE
# =============================================================================

tracing = Tracing("payment-service")


@tracing.traced("process_payment")
async def process_payment(payment_id: str, amount: float):
    """Process a payment with full tracing."""
    
    # Add attributes to span
    tracing.set_attribute("payment.id", payment_id)
    tracing.set_attribute("payment.amount", amount)
    
    # Validate payment
    with tracing.span("validate_payment"):
        validation_result = await validate_payment(payment_id, amount)
        if not validation_result.valid:
            tracing.add_event("validation_failed", {
                "reason": validation_result.reason
            })
            raise ValidationError(validation_result.reason)
    
    # Check fraud
    with tracing.span("fraud_check", {"payment_id": payment_id}):
        fraud_score = await fraud_service.check(payment_id)
        tracing.set_attribute("fraud.score", fraud_score)
    
    # Process with payment gateway
    with tracing.span("payment_gateway_call"):
        tracing.add_event("calling_stripe")
        result = await stripe_client.charge(payment_id, amount)
        tracing.add_event("stripe_response", {
            "transaction_id": result.transaction_id
        })
    
    return result


# The resulting trace:
#
# process_payment (total: 250ms)
# ├── validate_payment (15ms)
# ├── fraud_check (45ms)
# └── payment_gateway_call (190ms)
#     ├── Event: calling_stripe
#     └── Event: stripe_response {transaction_id: "txn_123"}

4.3 Trace Context Propagation

# tracing/propagation.py

"""
Trace context propagation across service boundaries.

When Service A calls Service B, how does Service B know
to continue the same trace?

Answer: Propagate trace context in HTTP headers.
"""

from opentelemetry import trace
from opentelemetry.propagate import inject, extract
from typing import Dict
import httpx


class TracedHTTPClient:
    """
    HTTP client that propagates trace context.
    """
    
    def __init__(self):
        self.client = httpx.AsyncClient()
    
    async def request(
        self,
        method: str,
        url: str,
        **kwargs
    ) -> httpx.Response:
        """Make request with trace context."""
        
        # Get current trace context
        headers = kwargs.get('headers', {})
        
        # Inject trace context into headers
        inject(headers)
        
        kwargs['headers'] = headers
        
        # Make request (trace context now in headers)
        return await self.client.request(method, url, **kwargs)
    
    async def get(self, url: str, **kwargs) -> httpx.Response:
        return await self.request("GET", url, **kwargs)
    
    async def post(self, url: str, **kwargs) -> httpx.Response:
        return await self.request("POST", url, **kwargs)


# The headers will look like:
# {
#   "traceparent": "00-abc123def456-0123456789-01",
#   "tracestate": "congo=t61rcWkgMzE"
# }
#
# Or with B3 format:
# {
#   "X-B3-TraceId": "abc123def456",
#   "X-B3-SpanId": "0123456789",
#   "X-B3-Sampled": "1"
# }


class TracedFastAPIMiddleware:
    """
    Middleware that extracts trace context from incoming requests.
    """
    
    async def __call__(self, request, call_next):
        # Extract trace context from headers
        context = extract(request.headers)
        
        # Continue trace or start new one
        tracer = trace.get_tracer(__name__)
        
        with tracer.start_as_current_span(
            f"{request.method} {request.url.path}",
            context=context,
            kind=trace.SpanKind.SERVER
        ) as span:
            # Add request attributes
            span.set_attribute("http.method", request.method)
            span.set_attribute("http.url", str(request.url))
            span.set_attribute("http.route", request.url.path)
            
            response = await call_next(request)
            
            # Add response attributes
            span.set_attribute("http.status_code", response.status_code)
            
            return response

Part II: Implementation

Chapter 5: Building an Observability Stack

5.1 The Observability Pipeline

OBSERVABILITY DATA FLOW

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  APPLICATION                                                           │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  Your Code                                                      │   │
│  │  ├── Metrics: prometheus_client                                 │   │
│  │  ├── Logs: structlog/logging → JSON                             │   │
│  │  └── Traces: opentelemetry-sdk                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                         │
│                              ▼                                         │
│  COLLECTION                                                            │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  Collectors/Agents                                              │   │
│  │  ├── Prometheus (scrapes /metrics endpoint)                     │   │
│  │  ├── Fluentd/Vector/Filebeat (ships logs)                       │   │
│  │  └── OpenTelemetry Collector (receives traces)                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                         │
│                              ▼                                         │
│  STORAGE                                                               │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  ├── Prometheus/Mimir/Thanos (metrics)                          │   │
│  │  ├── Elasticsearch/Loki (logs)                                  │   │
│  │  └── Jaeger/Tempo (traces)                                      │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                         │
│                              ▼                                         │
│  VISUALIZATION                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  Grafana                                                        │   │
│  │  ├── Dashboards (metrics)                                       │   │
│  │  ├── Explore (logs, traces)                                     │   │
│  │  └── Alerting                                                   │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

5.2 Instrumenting a FastAPI Application

# observability/fastapi_setup.py

"""
Complete observability setup for a FastAPI application.
"""

from fastapi import FastAPI, Request, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import time
import logging
import json
from typing import Callable
from contextvars import ContextVar

# =============================================================================
# METRICS SETUP
# =============================================================================

# Request metrics (RED method)
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)

REQUESTS_IN_PROGRESS = Gauge(
    'http_requests_in_progress',
    'HTTP requests currently being processed',
    ['method', 'endpoint']
)

# Business metrics
ORDERS_CREATED = Counter(
    'orders_created_total',
    'Total orders created',
    ['status', 'payment_method']
)

ORDER_VALUE = Histogram(
    'order_value_dollars',
    'Order value in dollars',
    buckets=[10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]
)


# =============================================================================
# LOGGING SETUP
# =============================================================================

_request_id: ContextVar[str] = ContextVar('request_id', default='')
_trace_id: ContextVar[str] = ContextVar('trace_id', default='')


class ObservabilityMiddleware:
    """
    Middleware that instruments requests with metrics, logs, and traces.
    """
    
    def __init__(self, app: FastAPI):
        self.app = app
    
    async def __call__(self, scope, receive, send):
        if scope['type'] != 'http':
            await self.app(scope, receive, send)
            return
        
        request = Request(scope, receive)
        method = request.method
        path = self._get_path_template(request)
        
        # Generate request ID
        request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
        _request_id.set(request_id)
        
        # Get trace ID from OpenTelemetry
        span = trace.get_current_span()
        trace_id = format(span.get_span_context().trace_id, '032x') if span else ''
        _trace_id.set(trace_id)
        
        # Track in-progress requests
        REQUESTS_IN_PROGRESS.labels(method=method, endpoint=path).inc()
        
        start_time = time.perf_counter()
        status_code = 500  # Default if exception occurs
        
        async def send_wrapper(message):
            nonlocal status_code
            if message['type'] == 'http.response.start':
                status_code = message['status']
            await send(message)
        
        try:
            # Log request
            logging.info(json.dumps({
                'event': 'request_started',
                'method': method,
                'path': str(request.url),
                'request_id': request_id,
                'trace_id': trace_id,
                'user_agent': request.headers.get('User-Agent', ''),
            }))
            
            await self.app(scope, receive, send_wrapper)
            
        finally:
            # Calculate duration
            duration = time.perf_counter() - start_time
            
            # Record metrics
            REQUEST_COUNT.labels(
                method=method,
                endpoint=path,
                status=str(status_code)
            ).inc()
            
            REQUEST_LATENCY.labels(
                method=method,
                endpoint=path
            ).observe(duration)
            
            REQUESTS_IN_PROGRESS.labels(method=method, endpoint=path).dec()
            
            # Log response
            logging.info(json.dumps({
                'event': 'request_completed',
                'method': method,
                'path': path,
                'status_code': status_code,
                'duration_ms': round(duration * 1000, 2),
                'request_id': request_id,
                'trace_id': trace_id,
            }))
    
    def _get_path_template(self, request: Request) -> str:
        """Get path template for metric labels (avoid high cardinality)."""
        # Use the route path template if available
        if hasattr(request, 'scope') and 'route' in request.scope:
            route = request.scope['route']
            if hasattr(route, 'path'):
                return route.path
        
        # Fallback: normalize the path
        path = request.url.path
        
        # Replace UUIDs with placeholder
        import re
        path = re.sub(
            r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}',
            '{id}',
            path
        )
        
        # Replace numeric IDs
        path = re.sub(r'/\d+', '/{id}', path)
        
        return path


def create_app() -> FastAPI:
    """Create FastAPI app with full observability."""
    app = FastAPI()
    
    # Add observability middleware
    app.add_middleware(ObservabilityMiddleware)
    
    # Instrument with OpenTelemetry
    FastAPIInstrumentor.instrument_app(app)
    
    # Metrics endpoint
    @app.get('/metrics')
    async def metrics():
        return Response(
            content=generate_latest(),
            media_type=CONTENT_TYPE_LATEST
        )
    
    # Health check (not instrumented)
    @app.get('/health')
    async def health():
        return {'status': 'healthy'}
    
    return app

Chapter 6: Dashboards and Alerting

6.1 Dashboard Design Principles

DASHBOARD DESIGN PRINCIPLES

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  PRINCIPLE 1: LAYERED DASHBOARDS                                       │
│  ───────────────────────────────                                       │
│                                                                        │
│  Layer 1: Executive (high-level)                                       │
│  ├── Overall availability                                              │
│  ├── Error budget status                                               │
│  └── Key business metrics                                              │
│                                                                        │
│  Layer 2: Service (per-service)                                        │
│  ├── RED metrics for this service                                      │
│  ├── Dependencies status                                               │
│  └── Recent deployments                                                │
│                                                                        │
│  Layer 3: Debug (detailed)                                             │
│  ├── Per-endpoint latency                                              │
│  ├── Resource utilization                                              │
│  └── Linked to logs and traces                                         │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  PRINCIPLE 2: THE 4 GOLDEN SIGNALS FIRST                               │
│  ────────────────────────────────────────                              │
│                                                                        │
│  Every service dashboard should show:                                  │
│  1. Latency - p50, p90, p99 over time                                  │
│  2. Traffic - Requests per second                                      │
│  3. Errors - Error rate, by type                                       │
│  4. Saturation - Resource utilization                                  │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  PRINCIPLE 3: TIME COMPARISONS                                         │
│  ─────────────────────────────                                         │
│                                                                        │
│  Show current vs:                                                      │
│  ├── Last hour                                                         │
│  ├── Same time yesterday                                               │
│  ├── Same time last week                                               │
│  └── Baseline (normal behavior)                                        │
│                                                                        │
│  This makes anomalies obvious.                                         │
│                                                                        │
│  ═══════════════════════════════════════════════════════════════════   │
│                                                                        │
│  PRINCIPLE 4: LINK EVERYTHING                                          │
│  ────────────────────────────                                          │
│                                                                        │
│  From any metric:                                                      │
│  ├── Click to see related logs                                         │
│  ├── Click to see traces for that time window                          │
│  ├── Click to see deployments                                          │
│  └── Click to run relevant queries                                     │
│                                                                        │
│  Reduce time to root cause.                                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

6.2 Alert Design

# alerting/rules.py

"""
Alert rule definitions following best practices.

Good alerts are:
- Actionable: Someone should DO something
- Urgent: It matters NOW
- Clear: The problem is obvious
- Linked: Points to runbook
"""

from dataclasses import dataclass
from typing import List, Optional
from enum import Enum


class Severity(Enum):
    CRITICAL = "critical"  # Page immediately, any time
    WARNING = "warning"    # Page during business hours, or ticket
    INFO = "info"          # Log, review later


@dataclass
class AlertRule:
    name: str
    expression: str
    duration: str  # How long condition must be true
    severity: Severity
    summary: str
    description: str
    runbook_url: str
    labels: dict = None


# =============================================================================
# GOOD ALERT EXAMPLES
# =============================================================================

ALERT_RULES: List[AlertRule] = [
    
    # SLO-based alert: Error budget burning fast
    AlertRule(
        name="ErrorBudgetBurnRateCritical",
        expression="""
            (
              sum(rate(http_requests_total{status=~"5.."}[5m]))
              /
              sum(rate(http_requests_total[5m]))
            ) > 14.4 * 0.001
        """,
        duration="5m",
        severity=Severity.CRITICAL,
        summary="Error budget burning at critical rate",
        description="Error rate is 14.4x the SLO threshold. "
                   "At this rate, monthly budget exhausts in 1 hour.",
        runbook_url="https://wiki/runbooks/error-budget-burn",
    ),
    
    # Symptom-based alert: Users experiencing errors
    AlertRule(
        name="HighErrorRate",
        expression="""
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
            > 0.05
        """,
        duration="5m",
        severity=Severity.CRITICAL,
        summary="High error rate: {{ $value | humanizePercentage }}",
        description="More than 5% of requests are failing. "
                   "Users are being impacted.",
        runbook_url="https://wiki/runbooks/high-error-rate",
    ),
    
    # Latency alert
    AlertRule(
        name="HighLatency",
        expression="""
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
            ) > 1
        """,
        duration="5m",
        severity=Severity.WARNING,
        summary="p99 latency above 1 second",
        description="99th percentile latency is {{ $value }}s. "
                   "Users may experience slow responses.",
        runbook_url="https://wiki/runbooks/high-latency",
    ),
    
    # Saturation alert: Resource exhaustion warning
    AlertRule(
        name="DatabaseConnectionPoolNearExhaustion",
        expression="""
            db_connections_active / db_connections_max > 0.8
        """,
        duration="10m",
        severity=Severity.WARNING,
        summary="Database connection pool at {{ $value | humanizePercentage }}",
        description="Connection pool is nearly exhausted. "
                   "Requests may start failing if this continues.",
        runbook_url="https://wiki/runbooks/db-connection-pool",
    ),
    
    # Dependency alert
    AlertRule(
        name="PaymentProviderErrors",
        expression="""
            sum(rate(external_api_requests_total{
              service="payment-provider",
              status="error"
            }[5m]))
            /
            sum(rate(external_api_requests_total{
              service="payment-provider"
            }[5m]))
            > 0.1
        """,
        duration="5m",
        severity=Severity.CRITICAL,
        summary="Payment provider error rate high",
        description="10%+ of payment provider requests failing. "
                   "Payments may be impacted.",
        runbook_url="https://wiki/runbooks/payment-provider-errors",
    ),
]


# =============================================================================
# BAD ALERT EXAMPLES (DON'T DO THIS)
# =============================================================================

BAD_ALERTS = """

❌ BAD: CPU alert
   expression: cpu_usage > 80
   
   Why bad:
   - High CPU might be fine if latency is good
   - Not actionable (what do you do?)
   - Doesn't indicate user impact
   
   Better: Alert on latency or error rate

❌ BAD: Any error alert
   expression: error_count > 0
   
   Why bad:
   - Some errors are normal
   - Extremely noisy
   - Desensitizes on-call
   
   Better: Alert on error RATE exceeding threshold

❌ BAD: Missing runbook
   summary: "Something went wrong"
   runbook: (none)
   
   Why bad:
   - On-call can't help if they don't know what to do
   - Increases mean time to recovery
   
   Better: Always include runbook link

❌ BAD: Unclear scope
   summary: "Errors detected"
   
   Why bad:
   - Which service?
   - Which endpoint?
   - How bad?
   
   Better: "Payment API: 5% error rate affecting checkout"

"""

6.3 Runbook Template

# Runbook: High Error Rate

## Alert Details
- **Alert name**: HighErrorRate
- **Severity**: Critical
- **Summary**: HTTP error rate exceeds 5%

## Quick Assessment (< 2 minutes)

1. **Check scope**: Is this one endpoint or all endpoints?

Query: sum by (endpoint) (rate(http_requests_total{status=~"5.."}[5m]))


2. **Check timeline**: When did it start?
- Look at dashboard for change point
- Check recent deployments

3. **Check error types**: What's failing?

Query: sum by (status, error_type) (rate(http_requests_total{status=~"5.."}[5m]))


## Common Causes and Fixes

### Cause 1: Recent deployment broke something
**Symptoms**:
- Errors started at deployment time
- Specific endpoint failing

**Fix**:
```bash
# Rollback to previous version
kubectl rollout undo deployment/api-server

Cause 2: Downstream service is down

Symptoms:

  • Errors are timeouts or connection refused
  • Logs show connection errors to specific service

Fix:

  1. Check downstream service status
  2. If down, check THEIR runbooks
  3. Consider enabling fallback/circuit breaker

Cause 3: Database is overloaded

Symptoms:

  • Errors are timeouts
  • Database CPU/connections high

Fix:

  1. Check slow query log
  2. Kill long-running queries if safe
  3. Consider scaling read replicas

Escalation

  • If not resolved in 15 minutes: Page senior on-call
  • If customer-reported: Loop in customer support
  • If data loss suspected: Page database team

Post-Incident

  • Incident documented
  • Timeline recorded
  • Root cause identified
  • Follow-up action items created

---

# Part III: Real-World Application

## Chapter 7: Case Studies

### 7.1 How Netflix Debugs Issues

NETFLIX'S OBSERVABILITY APPROACH

Context: ├── 200+ microservices ├── Millions of requests per second ├── Global infrastructure └── Any slowdown = users see buffering

Key Tools:

  1. ATLAS (Metrics) ├── In-house time series database ├── Handles millions of metrics ├── Real-time streaming aggregation └── Used for dashboards and alerts

  2. EDGAR (Real-time Traces) ├── Samples 1% of requests ├── Full distributed traces ├── Automatic anomaly detection └── Links to playback issues

  3. LUMEN (Log Analysis) ├── Centralized logging ├── Machine learning for anomaly detection ├── Correlation with metrics and traces └── Root cause suggestions

Debug Workflow: ├── 1. Alert fires (Atlas) ├── 2. Check which service (Atlas dashboards) ├── 3. Find affected traces (Edgar) ├── 4. Examine logs for that trace (Lumen) ├── 5. Identify root cause └── 6. Fix and verify

Key Insight: They built custom tools because scale demanded it. Most companies should use off-the-shelf tools.


### 7.2 How Google Does Observability

GOOGLE'S OBSERVABILITY PHILOSOPHY

From the SRE book:

"Observability is about surfacing emergent behavior, not just known failure modes."

Key Principles:

  1. STRUCTURED DATA EVERYWHERE ├── All logs are structured (protobuf) ├── All metrics have consistent labels └── All traces follow consistent format

  2. CORRELATION BY DEFAULT ├── Every request has a trace ID ├── Trace ID propagates across all services └── Logs, metrics, traces linked by trace ID

  3. GOLDEN SIGNALS FOCUS ├── Latency ├── Traffic ├── Errors └── Saturation

  4. PROGRESSIVE ROLLOUTS + OBSERVABILITY ├── Canary deployments ├── Automatic rollback if metrics degrade └── Observability enables safe deployments

Tools:

├── Monarch (metrics) - Handles Google scale ├── Dapper (traces) - Invented distributed tracing ├── Cloud Logging (logs) - Structured log storage └── Cloud Monitoring - Unified dashboards

Lesson: Observability is not optional at scale. It's infrastructure, not an afterthought.


---

## Chapter 8: Common Mistakes

OBSERVABILITY ANTI-PATTERNS

❌ MISTAKE 1: Too Many Metrics

Wrong:

  • 10,000 metric names
  • High-cardinality labels (user_id, request_id)
  • Everything has a metric

Problems:

  • Metrics storage explodes
  • Dashboards are overwhelming
  • Can't find signal in noise

Right:

  • Focused metrics (RED, USE)
  • Low-cardinality labels
  • If you won't alert or dashboard, don't metric

❌ MISTAKE 2: Logs Without Context

Wrong: log.error("Payment failed")

Problems:

  • Which payment?
  • Which user?
  • What was the error?
  • Can't correlate to other events

Right: log.error("Payment failed", { payment_id: "pay-123", user_id: "user-456", error_code: "insufficient_funds", trace_id: "abc-789" })

❌ MISTAKE 3: No Correlation

Wrong:

  • Metrics in Prometheus
  • Logs in Elasticsearch
  • Traces in Jaeger
  • No links between them

Problems:

  • Switching between tools is slow
  • Context lost during debugging
  • Longer time to resolution

Right:

  • Include trace_id in all logs
  • Link dashboards to logs/traces
  • Use exemplars (link metrics to traces)

❌ MISTAKE 4: Alert on Causes, Not Symptoms

Wrong: Alert: "CPU usage > 80%"

Problems:

  • CPU can be high while system is healthy
  • Doesn't indicate user impact
  • Not actionable

Right: Alert: "p99 latency > 500ms"

This is a symptom users experience. CPU is just one possible cause.

❌ MISTAKE 5: No Runbooks

Wrong: Alert: "Database errors detected" Action: ???

Problems:

  • On-call doesn't know what to do
  • Different people do different things
  • Longer MTTR

Right: Every alert has a runbook link. Runbook has step-by-step diagnosis. Runbook is tested and updated.


---

# Part IV: Interview Preparation

## Chapter 9: Interview Tips

### 9.1 Observability Discussion Framework

DISCUSSING OBSERVABILITY IN INTERVIEWS

When asked "How would you debug a production issue?":

  1. START WITH DETECTION "First, we need to know there's a problem. I'd set up alerting based on SLOs — if error rate exceeds budget burn rate, we get paged."

  2. DESCRIBE THE THREE PILLARS "For debugging, I rely on three signals:

    • Metrics show me WHEN and HOW BAD
    • Logs show me WHAT happened
    • Traces show me WHERE in the request path"
  3. WALK THROUGH A DEBUG SESSION "If I get an alert for high latency:

    1. Check dashboard - which service is slow?
    2. Look at traces - where in the call chain?
    3. Examine logs - what errors are happening?
    4. Correlate with deployments - did something change?"
  4. MENTION CORRELATION "The key is having trace IDs that link everything. I should be able to go from metric → trace → logs for a single request seamlessly."

  5. TALK ABOUT ACTIONABILITY "Every alert needs a runbook. When I'm paged at 3am, I need to know exactly what steps to take."


### 9.2 Key Phrases

OBSERVABILITY KEY PHRASES

On the Three Pillars: "Metrics tell me WHEN something is wrong and HOW BAD. Logs tell me WHAT happened specifically. Traces tell me WHERE in the system it happened. Together, they answer WHY."

On Correlation: "The first thing I do is find the trace ID. From there, I can see the entire request journey, pull up the logs for each service involved, and correlate with metrics for that time window."

On Alert Design: "I alert on symptoms, not causes. Users don't care if CPU is high — they care if their request is slow. So I alert on latency, not CPU. CPU is something I investigate after the alert."

On Structured Logging: "Every log line includes context: trace_id, user_id, request_id, and relevant business data. Without structure, logs are just text to grep through. With structure, I can query 'show me all errors for user X in the last hour.'"

On Debugging: "My debug workflow is:

  1. Alert tells me something is wrong
  2. Dashboard shows me scope and timeline
  3. Traces show me the slow/failing path
  4. Logs give me the specific error
  5. This usually identifies the root cause in under 10 minutes."

---

## Chapter 10: Practice Problems

### Problem 1: Debug a Slow API

**Scenario:**
Users report the checkout API is slow. How do you investigate?

**Questions:**
1. What metrics would you check first?
2. How would you use traces to narrow down the issue?
3. What logs would you look for?

<details>
<summary>Click for approach</summary>

1. **Metrics first:**
   - Check p99 latency for /checkout endpoint
   - Compare to baseline (yesterday, last week)
   - Check error rate (slowness might be from retries)

2. **Traces to narrow down:**
   - Sample traces for slow requests (> 1s)
   - Look at span durations to find the slow segment
   - Is it database? External API? Computation?

3. **Logs for details:**
   - Search logs for slow trace IDs
   - Look for specific errors or warnings
   - Check for unusual patterns (large payloads, etc.)

</details>

### Problem 2: Design Observability for a New Service

**Scenario:**
You're launching a new notification service. Design the observability.

**Questions:**
1. What metrics would you expose?
2. What should be logged?
3. How would you set up tracing?

<details>
<summary>Click for approach</summary>

1. **Metrics (RED + business):**
   - `notifications_sent_total{channel, status}`
   - `notification_latency_seconds{channel}`
   - `notification_queue_depth`
   - `provider_request_duration_seconds{provider}`

2. **Logging:**
   - INFO: Notification sent (notification_id, user_id, channel)
   - WARNING: Retry attempted, provider fallback
   - ERROR: Send failed, provider error

3. **Tracing:**
   - Trace from API receipt to delivery confirmation
   - Spans for: validation, routing, provider call, callback
   - Include notification_id as span attribute

</details>

---

## Chapter 11: Sample Interview Dialogue

**Interviewer**: "You're on call and get paged for high latency. Walk me through how you'd debug it."

**You**: "Sure. Let me walk through my process.

**First, I acknowledge the alert and check the dashboard:**"

Dashboard shows: ├── p99 latency: 2.3s (normally 200ms) ├── Started: 10 minutes ago ├── Affected: /api/orders endpoint └── Error rate: Normal (so it's slow, not failing)


**"Now I know it's the orders endpoint. Let me look at traces:**"

Sample slow trace (trace_id: abc-123): ├── api-gateway: 5ms ✓ ├── orders-service: 2.2s ← SLOW │ ├── get-user: 50ms ✓ │ ├── get-products: 2s ← HERE'S THE PROBLEM │ │ └── database-query: 2s │ └── calculate-total: 10ms ✓ └── Total: 2.3s


**Interviewer**: "So you found it's the products database query. Now what?"

**You**: "I'd check the logs for that trace ID to see the actual query:"

```json
{
  "trace_id": "abc-123",
  "span": "database-query",
  "query": "SELECT * FROM products WHERE id IN (...)",
  "duration_ms": 2000,
  "rows_returned": 500
}

"500 rows is a lot. Let me check if this is a new pattern:"

Query metrics show:
├── rows_returned p99: Was 10, now 500
├── Change started: Matches latency spike
└── Correlates with: New customer onboarded

**"Root cause: A new customer placed an order with 500 products, overwhelming our query. The fix could be pagination or a query optimization.

For immediate mitigation, I might add a query timeout or limit. Long-term, we need to optimize for large orders."**

Interviewer: "Good systematic approach. How would you prevent this in the future?"

You: "A few things:

  1. Add an alert for abnormal query row counts
  2. Add a dashboard panel for p99 rows returned
  3. Consider adding a limit in the code
  4. Load test with realistic order sizes

The key is making the unusual case visible before it becomes a problem."


Summary

┌────────────────────────────────────────────────────────────────────────┐
│                    DAY 2 KEY TAKEAWAYS                                 │
│                                                                        │
│  THE THREE PILLARS:                                                    │
│  ├── Metrics: WHEN and HOW BAD (aggregated numbers)                    │
│  ├── Logs: WHAT happened (detailed events)                             │
│  ├── Traces: WHERE in the system (request journey)                     │
│  └── Together: WHY it happened                                         │
│                                                                        │
│  METRICS BEST PRACTICES:                                               │
│  ├── Use RED method for services (Rate, Errors, Duration)              │
│  ├── Use USE method for resources (Utilization, Saturation, Errors)    │
│  ├── Low-cardinality labels only                                       │
│  └── Include units in metric names                                     │
│                                                                        │
│  LOGGING BEST PRACTICES:                                               │
│  ├── Structured JSON format                                            │
│  ├── Include trace_id for correlation                                  │
│  ├── Include context (user_id, request_id)                             │
│  └── Don't log sensitive data                                          │
│                                                                        │
│  TRACING BEST PRACTICES:                                               │
│  ├── Propagate trace context across services                           │
│  ├── Name spans meaningfully                                           │
│  ├── Add relevant attributes                                           │
│  └── Sample appropriately at scale                                     │
│                                                                        │
│  ALERTING BEST PRACTICES:                                              │
│  ├── Alert on symptoms, not causes                                     │
│  ├── Every alert needs a runbook                                       │
│  ├── Use multi-window burn rate for SLOs                               │
│  └── Severity should be clear and consistent                           │
│                                                                        │
│  THE DEBUG WORKFLOW:                                                   │
│  1. Alert fires → Check dashboard for scope                            │
│  2. Find slow/failing traces                                           │
│  3. Examine spans to locate issue                                      │
│  4. Check logs for details                                             │
│  5. Correlate with changes (deploys, etc.)                             │
│  6. Fix and verify                                                     │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Further Reading

Books:

  • "Observability Engineering" by Charity Majors, Liz Fong-Jones, George Miranda
  • "Distributed Systems Observability" by Cindy Sridharan

Tools:

  • Grafana: Dashboards and visualization
  • Prometheus: Metrics collection and alerting
  • Jaeger/Tempo: Distributed tracing
  • Loki: Log aggregation
  • OpenTelemetry: Instrumentation standard

Articles:

  • Google: "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure"
  • Netflix: "Edgar: Solving Mysteries Faster with Observability"
  • Honeycomb Blog: Observability best practices

End of Day 2: Observability

Tomorrow: Day 3 — Deployment Strategies. You can define health (SLOs) and see health (observability). Now you need to ship changes without breaking that health. Blue-green, canary, feature flags — the techniques that let you deploy with confidence.


You're building the complete toolkit of a production engineer. Define health. See health. Now: maintain health through change.