Capstone

Week 8 Capstone: Real-Time Analytics for Video Streaming Platform

🎯 A Complete System Design Interview Integrating All Week 8 Concepts

The Interview Begins

You walk into the interview room at a major streaming company. The interviewer, a Principal Engineer, greets you and gestures to the whiteboard.

Interviewer: "Thanks for coming in. Today we're going to design the analytics platform for our video streaming service. We have 150 million subscribers watching content across mobile, web, smart TVs, and gaming consoles. We need to track everything — what people watch, when they pause, what they search for — and turn that into actionable insights."

They write on the whiteboard:

╔══════════════════════════════════════════════════════════════════════════╗
║                                                                          ║
║           Design: Real-Time Analytics for Video Streaming                ║
║                                                                          ║
║   Context:                                                               ║
║   - 150M subscribers globally                                            ║
║   - 10M peak concurrent viewers                                          ║
║   - Content library: 50K titles (movies, series, documentaries)          ║
║   - Devices: Mobile (40%), Smart TV (35%), Web (15%), Gaming (10%)       ║
║                                                                          ║
║   Requirements:                                                          ║
║   1. Track all viewing events (play, pause, seek, complete)              ║
║   2. Real-time "Trending Now" dashboard (updates every minute)           ║
║   3. Content performance analytics for creators                          ║
║   4. A/B test analysis for product experiments                           ║
║   5. Historical viewing patterns for recommendations                     ║
║                                                                          ║
╚══════════════════════════════════════════════════════════════════════════╝

Interviewer: "Take a few minutes to think about this, ask clarifying questions, and then walk me through your design. We have about 45 minutes."

Phase 1: Requirements Clarification (5 minutes)

Before diving in, you take a breath and start asking questions.

Your Questions

You: "This is a great problem. Before I start designing, I'd like to clarify a few requirements. First, what's the expected event volume? How many events per second should we plan for?"

Interviewer: "At peak, we see about 10 million concurrent viewers. Each viewer generates roughly 1 event every 10 seconds on average — heartbeats, seeks, quality changes. So peak is around 1 million events per second."

You: "That's significant scale. For the 'Trending Now' dashboard, what's the freshness requirement? You mentioned updates every minute — is that the business requirement or current state?"

Interviewer: "Business wants it under 1 minute. Ideally 30 seconds or less. We compete with social media for attention, so trending content needs to surface fast."

You: "Understood. For historical analytics and recommendations, what's the retention requirement? How far back do we need to query?"

Interviewer: "We need at least 2 years of viewing history for recommendations. For operational analytics, 90 days of detailed data is sufficient. Aggregated data we keep indefinitely."

You: "For the device mix you mentioned — mobile is 40%. Mobile apps often have connectivity issues. What's an acceptable lateness for events from mobile devices?"

Interviewer: "Good catch. We see mobile events arriving anywhere from real-time to several hours late when users were on a plane or subway. We need to handle that gracefully."

You: "One more question — for A/B testing, how many concurrent experiments do you typically run, and what's the analysis latency requirement?"

Interviewer: "We run 50-100 experiments at any time. For analysis, daily aggregates are fine — we don't need real-time experiment results."

You: "Perfect. Let me summarize the requirements as I understand them."

Functional Requirements

1. EVENT TRACKING
   - Play, pause, resume, seek, stop, complete events
   - Quality change events (buffering, resolution switches)
   - Search and browse events
   - Heartbeat events (every 10 seconds during playback)

2. REAL-TIME DASHBOARDS
   - "Trending Now" — updated every 30-60 seconds
   - Live viewership counts per title
   - Geographic distribution of viewers
   - Quality metrics (buffering rate, errors)

3. CONTENT ANALYTICS (for creators/studios)
   - Total views, unique viewers, completion rate
   - Audience demographics and geography
   - Comparison to similar content
   - Performance over time (daily/weekly/monthly)

4. A/B TEST ANALYSIS
   - Track experiment assignments
   - Calculate metrics per variant
   - Statistical significance computation
   - Daily refresh acceptable

5. HISTORICAL ANALYTICS
   - 2 years of viewing data for recommendations
   - 90 days detailed, older aggregated
   - Ad-hoc queries by analysts

Non-Functional Requirements

1. SCALE
   - Peak: 1M events/second (10M concurrent viewers)
   - Daily: 50B events
   - Storage: ~5TB/day raw, ~500GB/day aggregated

2. LATENCY
   - Event ingestion: < 1 second acknowledgment
   - Trending dashboard: < 60 second freshness
   - Historical queries: < 30 seconds for 90-day range

3. RELIABILITY
   - No event loss (at-least-once delivery)
   - Late data handling for mobile (up to 24 hours)
   - 99.9% availability for dashboards

4. DATA QUALITY
   - Deduplication of replayed events
   - Handle out-of-order events correctly
   - Clear distinction between real-time and final numbers

Phase 2: Back of the Envelope Estimation (5 minutes)

You: "Let me work through the numbers to validate scale assumptions."

Traffic Estimation

EVENT VOLUME

Peak concurrent viewers:     10,000,000 viewers
Events per viewer:           1 event / 10 seconds
Peak event rate:             1,000,000 events/second

Daily pattern:
├── Peak hours (4 hours):    1M events/sec × 4 hours = 14.4B events
├── High hours (8 hours):    500K events/sec × 8 hours = 14.4B events
├── Normal hours (12 hours): 200K events/sec × 12 hours = 8.6B events
└── Total daily:             ~40-50B events

Event breakdown:
├── Heartbeats (70%):        35B/day (high volume, low value)
├── Playback events (20%):   10B/day (play, pause, seek, complete)
├── Browse/Search (8%):      4B/day
└── Quality events (2%):     1B/day

Storage Estimation

RAW EVENT STORAGE

Event size (average):
├── Event metadata:          100 bytes (event_id, type, timestamp)
├── User context:            50 bytes (user_id, device_id, session_id)
├── Content context:         50 bytes (content_id, episode_id)
├── Playback state:          50 bytes (position, duration, quality)
├── Device info:             50 bytes (device_type, os, app_version)
└── Total per event:         ~300 bytes

Daily raw storage:
├── 50B events × 300 bytes = 15TB uncompressed
├── With compression (5x):   3TB/day
├── With replication (3x):   9TB/day effective
└── 90-day retention:        810TB

Aggregated storage:
├── Hourly aggregates:       ~10GB/day
├── Daily aggregates:        ~1GB/day
├── 2-year retention:        ~750GB total aggregates

Key Metrics Summary

┌────────────────────────────────────────────────────────────────────────┐
│                    ESTIMATION SUMMARY                                  │
│                                                                        │
│  TRAFFIC                                                               │
│  ├── Peak events:            1,000,000 /second                         │
│  ├── Daily events:           50 billion                                │
│  └── Monthly events:         1.5 trillion                              │
│                                                                        │
│  STORAGE                                                               │
│  ├── Daily raw (compressed): 3 TB                                      │
│  ├── 90-day raw:             270 TB                                    │
│  ├── Daily aggregates:       10 GB                                     │
│  └── Total system:           ~1 PB                                     │
│                                                                        │
│  INFRASTRUCTURE (rough)                                                │
│  ├── Kafka brokers:          50+ (for 1M events/sec)                   │
│  ├── Flink workers:          100+ (for streaming)                      │
│  ├── ClickHouse nodes:       20+ (for real-time OLAP)                  │
│  └── Spark cluster:          500+ cores (for batch)                    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Phase 3: High-Level Design (10 minutes)

You: "Now let me sketch out the high-level architecture."

System Architecture

┌───────────────────────────────────────────────────────────────────────────┐
│                         HIGH-LEVEL ARCHITECTURE                           │
│                                                                           │
│  ┌────────────────────────────────────────────────────────────────────┐   │
│  │                         CLIENT DEVICES                             │   │
│  │    ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐      │   │
│  │    │ Mobile │  │Smart TV│  │  Web   │  │ Gaming │  │ Other  │      │   │
│  │    │  40%   │  │  35%   │  │  15%   │  │  10%   │  │        │      │   │
│  │    └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘      │   │
│  └────────┼───────────┼───────────┼───────────┼───────────┼───────────┘   │
│           └───────────┴─────┬─────┴───────────┴───────────┘               │
│                             │                                             │
│                             ▼                                             │
│  ┌────────────────────────────────────────────────────────────────────┐   │
│  │                      INGESTION LAYER                               │   │
│  │  ┌─────────────┐    ┌──────────────┐    ┌─────────────┐            │   │
│  │  │   CDN/Edge  │───▶│  API Gateway │───▶│  Collectors │            │   │
│  │  │  (Regional) │    │  (Regional)  │    │  (Regional) │            │   │
│  │  └─────────────┘    └──────────────┘    └──────┬──────┘            │   │
│  └────────────────────────────────────────────────┼───────────────────┘   │
│                                                   │                       │
│                                                   ▼                       │
│  ┌────────────────────────────────────────────────────────────────────┐   │
│  │                      MESSAGE QUEUE                                 │   │
│  │  ┌──────────────────────────────────────────────────────────────┐  │   │
│  │  │                    Kafka Cluster                             │  │   │
│  │  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐         │  │   │
│  │  │  │ events.  │ │ events.  │ │ events.  │ │ events.  │         │  │   │
│  │  │  │ playback │ │ browse   │ │ quality  │ │ dead_    │         │  │   │
│  │  │  │          │ │          │ │          │ │ letter   │         │  │   │
│  │  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘         │  │   │
│  │  └──────────────────────────────────────────────────────────────┘  │   │
│  └──────────────────────────┬─────────────────────────────────────────┘   │
│                             │                                             │
│              ┌──────────────┴──────────────┐                              │
│              │                             │                              │
│              ▼                             ▼                              │
│  ┌──────────────────────────┐  ┌──────────────────────────┐               │
│  │    STREAMING LAYER       │  │      BATCH LAYER         │               │
│  │  ┌────────────────────┐  │  │  ┌────────────────────┐  │               │
│  │  │   Flink Cluster    │  │  │  │   Spark Cluster    │  │               │
│  │  │  ├── Aggregations  │  │  │  │  ├── Daily ETL     │  │               │
│  │  │  ├── Trending calc │  │  │  │  ├── Late data     │  │               │
│  │  │  ├── Sessionization│  │  │  │  ├── A/B analysis  │  │               │
│  │  │  └── Late handling │  │  │  │  └── ML features   │  │               │
│  │  └─────────┬──────────┘  │  │  └─────────┬──────────┘  │               │
│  └────────────┼─────────────┘  └────────────┼─────────────┘               │
│               │                             │                             │
│               ▼                             ▼                             │
│  ┌────────────────────────────────────────────────────────────────────┐   │
│  │                      SERVING LAYER                                 │   │
│  │                                                                    │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌───────────┐  │   │
│  │  │ ClickHouse  │  │  BigQuery   │  │    Redis    │  │    S3     │  │   │
│  │  │ (Real-time) │  │ (Historical)│  │  (Cache)    │  │(Data Lake)│  │   │
│  │  │             │  │             │  │             │  │           │  │   │
│  │  │ Trending    │  │ Ad-hoc      │  │ Dashboard   │  │ Raw events│  │   │
│  │  │ Live counts │  │ Analytics   │  │ Results     │  │ Parquet   │  │   │
│  │  │ Quality     │  │ A/B tests   │  │ Hot data    │  │ Archives  │  │   │
│  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └───────────┘  │   │
│  └─────────┼────────────────┼────────────────┼────────────────────────┘   │
│            └────────────────┼────────────────┘                            │
│                             │                                             │
│                             ▼                                             │
│  ┌────────────────────────────────────────────────────────────────────┐   │
│  │                      QUERY LAYER                                   │   │
│  │  ┌─────────────────────────────────────────────────────────────┐   │   │
│  │  │                  Analytics API                              │   │   │
│  │  │  ├── /trending         (ClickHouse, 30s cache)              │   │   │
│  │  │  ├── /content/{id}     (BigQuery, 5min cache)               │   │   │
│  │  │  ├── /experiments      (BigQuery, 1hr cache)                │   │   │
│  │  │  └── /explore          (BigQuery, no cache)                 │   │   │
│  │  └─────────────────────────────────────────────────────────────┘   │   │
│  └────────────────────────────────────────────────────────────────────┘   │
│                                                                           │
└───────────────────────────────────────────────────────────────────────────┘

Component Breakdown

You: "Let me walk through each component..."

1. Ingestion Layer

Purpose: Receive events from all client devices reliably at scale.

Key responsibilities:

Accept events from 10M+ concurrent clients
Validate event schema and reject malformed events
Buffer events for reliability
Route to appropriate Kafka topics

Technology choice: Regional API Gateways + Event Collectors because we need low-latency ingestion globally and can validate/enrich events close to users.

2. Message Queue (Kafka)

Purpose: Durable, ordered event storage and distribution.

Key responsibilities:

Buffer 1M+ events/second
Retain events for replay (7 days)
Distribute to streaming and batch consumers
Handle backpressure gracefully

Topic design:

events.playback — Play, pause, seek, complete (partitioned by user_id)
events.browse — Search, navigation events
events.quality — Buffering, errors, quality changes
events.dead_letter — Failed validations for inspection

3. Streaming Layer (Flink)

Purpose: Real-time processing for dashboards and alerting.

Key responsibilities:

Calculate trending content (1-minute windows)
Maintain live viewer counts per title
Detect quality issues in real-time
Handle late data with watermarks

4. Batch Layer (Spark)

Purpose: Accurate historical analytics and heavy computations.

Key responsibilities:

Daily aggregation jobs (finalized numbers)
Late data reprocessing
A/B test statistical analysis
ML feature generation

5. Serving Layer

Purpose: Store processed data for different query patterns.

Store	Use Case	Query Pattern
ClickHouse	Real-time dashboards	Sub-second, high concurrency
BigQuery	Historical analytics	Complex queries, ad-hoc
Redis	Dashboard caching	Millisecond lookups
S3	Raw data lake	Batch processing, ML

Phase 4: Deep Dives (20 minutes)

Interviewer: "Great high-level design. Let's dive deeper into a few areas. Start with the event schema and ingestion — how would you design the events?"

Deep Dive 1: Event Schema Design (Day 1 Concept)

You: "Event schema design is critical. Bad schemas create debt that compounds over time. Let me show you what I'd propose."

The Problem

SCHEMA DESIGN CHALLENGES

Without careful design:
├── Different teams send different formats
├── Schema changes break downstream consumers
├── Missing fields cause analytics gaps
├── No way to evolve schema safely

We need:
├── Consistent structure across all event types
├── Forward and backward compatibility
├── Rich enough for analytics, lean enough for volume
├── Clear versioning for evolution

The Solution

You: "I'd use an envelope pattern with a common header and type-specific payload."

EVENT SCHEMA DESIGN

┌────────────────────────────────────────────────────────────────────────┐
│                      EVENT ENVELOPE                                    │
│                                                                        │
│  {                                                                     │
│    // HEADER (common to all events)                                    │
│    "event_id": "evt_abc123",           // UUID, idempotency key        │
│    "event_type": "playback.started",   // domain.entity.action         │
│    "event_version": "2.1",             // schema version               │
│    "event_time": "2024-01-15T14:30:00.123Z",  // when it happened      │
│    "received_time": "2024-01-15T14:30:00.456Z", // when server got it  │
│                                                                        │
│    // CONTEXT (who/where)                                              │
│    "user_id": "usr_12345",                                             │
│    "device_id": "dev_67890",                                           │
│    "session_id": "sess_abcdef",                                        │
│    "device_type": "mobile",            // mobile, tv, web, gaming      │
│    "app_version": "5.2.1",                                             │
│    "geo": {                                                            │
│      "country": "US",                                                  │
│      "region": "CA"                                                    │
│    },                                                                  │
│                                                                        │
│    // PAYLOAD (event-specific)                                         │
│    "payload": {                                                        │
│      "content_id": "mov_789",                                          │
│      "content_type": "movie",                                          │
│      "position_seconds": 0,                                            │
│      "duration_seconds": 7200,                                         │
│      "quality": "1080p",                                               │
│      "audio_language": "en",                                           │
│      "subtitle_language": "es"                                         │
│    },                                                                  │
│                                                                        │
│    // EXPERIMENTS (A/B test context)                                   │
│    "experiments": {                                                    │
│      "new_player_ui": "variant_b",                                     │
│      "recommendation_algo": "control"                                  │
│    }                                                                   │
│  }                                                                     │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Event Types

# Event type hierarchy

EVENT_TYPES = {
    # Playback events (highest volume)
    "playback.started": "User started playing content",
    "playback.paused": "User paused playback",
    "playback.resumed": "User resumed from pause",
    "playback.seeked": "User jumped to different position",
    "playback.completed": "User finished content (>90% watched)",
    "playback.abandoned": "User stopped before completion",
    "playback.heartbeat": "Periodic signal during playback",
    
    # Quality events
    "quality.buffering_started": "Playback buffering began",
    "quality.buffering_ended": "Playback resumed after buffer",
    "quality.resolution_changed": "Video quality changed",
    "quality.error": "Playback error occurred",
    
    # Browse events
    "browse.search": "User performed search",
    "browse.page_viewed": "User viewed a page/row",
    "browse.content_clicked": "User clicked on content tile",
    "browse.content_added_list": "User added to My List",
    
    # Session events
    "session.started": "App opened / session began",
    "session.ended": "App closed / session ended"
}

Schema Evolution

# Schema registry for safe evolution

class ViewingEventSchema:
    """
    Schema evolution rules:
    - SAFE: Add optional fields
    - UNSAFE: Remove fields, change types, rename
    """
    
    VERSION = "2.1"
    
    # v1.0: Original schema
    # v2.0: Added experiments field
    # v2.1: Added subtitle_language to payload
    
    REQUIRED_FIELDS = [
        "event_id", "event_type", "event_time",
        "user_id", "device_id", "payload"
    ]
    
    OPTIONAL_FIELDS = [
        "session_id", "geo", "experiments",
        "received_time", "app_version"
    ]

Interviewer: "How do you handle the 1 million events per second throughput?"

Deep Dive 2: High-Throughput Ingestion (Day 1 Concept)

You: "At 1M events/second, we need a carefully designed ingestion pipeline."

Ingestion Architecture

HIGH-THROUGHPUT INGESTION

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  REGIONAL INGESTION (5 regions)                                        │
│                                                                        │
│  Each region handles ~200K events/sec                                  │
│                                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  US-EAST                                                        │   │
│  │                                                                 │   │
│  │  CDN (CloudFront)                                               │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  ┌─────────────────────────────────────────────────────────┐    │   │
│  │  │  API Gateway (ALB + Lambda or ECS)                      │    │   │
│  │  │  - TLS termination                                      │    │   │
│  │  │  - Rate limiting per device                             │    │   │
│  │  │  - Request validation                                   │    │   │
│  │  └─────────────────────────────────────────────────────────┘    │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  ┌─────────────────────────────────────────────────────────┐    │   │
│  │  │  Event Collectors (20 instances)                        │    │   │
│  │  │  - Schema validation                                    │    │   │
│  │  │  - Enrichment (geo, device info)                        │    │   │
│  │  │  - Batching (100 events or 100ms)                       │    │   │
│  │  │  - Async produce to Kafka                               │    │   │
│  │  └─────────────────────────────────────────────────────────┘    │   │
│  │       │                                                         │   │
│  │       ▼                                                         │   │
│  │  ┌─────────────────────────────────────────────────────────┐    │   │
│  │  │  Regional Kafka Cluster                                 │    │   │
│  │  │  - 50 partitions per topic                              │    │   │
│  │  │  - Replication factor 3                                 │    │   │
│  │  │  - 7-day retention                                      │    │   │
│  │  └─────────────────────────────────────────────────────────┘    │   │
│  │                                                                 │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                        │
│  Cross-region replication to central data lake                         │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Implementation

# Ingestion service optimized for throughput

from dataclasses import dataclass
from typing import List, Dict
import asyncio
from datetime import datetime


@dataclass
class IngestionConfig:
    batch_size: int = 100
    batch_timeout_ms: int = 100
    max_concurrent_batches: int = 50
    kafka_linger_ms: int = 5
    kafka_batch_size: int = 16384


class HighThroughputCollector:
    """
    Event collector optimized for 200K+ events/second per instance.
    
    Key optimizations:
    1. Async everywhere - no blocking
    2. Batch events before Kafka produce
    3. Schema validation with compiled validators
    4. Connection pooling for Kafka
    """
    
    def __init__(self, config: IngestionConfig, kafka_producer, validator):
        self.config = config
        self.producer = kafka_producer
        self.validator = validator
        
        # Event batching
        self.batch: List[Dict] = []
        self.batch_lock = asyncio.Lock()
        self.last_flush = datetime.utcnow()
        
        # Metrics
        self.events_received = 0
        self.events_validated = 0
        self.events_produced = 0
        self.validation_errors = 0
    
    async def ingest(self, events: List[Dict]) -> Dict:
        """
        Ingest a batch of events from client.
        
        Returns acknowledgment immediately after validation.
        Kafka produce happens asynchronously.
        """
        
        results = {"accepted": 0, "rejected": 0, "errors": []}
        
        for event in events:
            self.events_received += 1
            
            # Fast validation (compiled schema)
            validation_result = self.validator.validate(event)
            
            if not validation_result.is_valid:
                results["rejected"] += 1
                results["errors"].append({
                    "event_id": event.get("event_id"),
                    "error": validation_result.error
                })
                self.validation_errors += 1
                continue
            
            # Enrich event
            enriched = self._enrich(event)
            
            # Add to batch (non-blocking)
            await self._add_to_batch(enriched)
            
            results["accepted"] += 1
            self.events_validated += 1
        
        return results
    
    def _enrich(self, event: Dict) -> Dict:
        """Add server-side enrichments."""
        
        event["received_time"] = datetime.utcnow().isoformat() + "Z"
        event["collector_id"] = self.collector_id
        event["processing_region"] = self.region
        
        # Add derived fields
        if "geo" in event and "country" in event["geo"]:
            event["geo"]["continent"] = self._country_to_continent(
                event["geo"]["country"]
            )
        
        return event
    
    async def _add_to_batch(self, event: Dict):
        """Add event to batch, flush if needed."""
        
        async with self.batch_lock:
            self.batch.append(event)
            
            should_flush = (
                len(self.batch) >= self.config.batch_size or
                self._batch_age_ms() >= self.config.batch_timeout_ms
            )
            
            if should_flush:
                batch_to_send = self.batch
                self.batch = []
                self.last_flush = datetime.utcnow()
                
                # Fire and forget - don't await
                asyncio.create_task(self._produce_batch(batch_to_send))
    
    async def _produce_batch(self, batch: List[Dict]):
        """Produce batch to Kafka."""
        
        # Group by topic
        by_topic = {}
        for event in batch:
            topic = self._get_topic(event["event_type"])
            if topic not in by_topic:
                by_topic[topic] = []
            by_topic[topic].append(event)
        
        # Produce to each topic
        for topic, events in by_topic.items():
            for event in events:
                # Partition by user_id for ordering
                partition_key = event.get("user_id", event["event_id"])
                
                await self.producer.send(
                    topic=topic,
                    key=partition_key,
                    value=event
                )
                self.events_produced += 1
    
    def _get_topic(self, event_type: str) -> str:
        """Route event to appropriate Kafka topic."""
        
        if event_type.startswith("playback."):
            return "events.playback"
        elif event_type.startswith("quality."):
            return "events.quality"
        elif event_type.startswith("browse."):
            return "events.browse"
        else:
            return "events.other"
    
    def _batch_age_ms(self) -> int:
        """Get age of current batch in milliseconds."""
        return int((datetime.utcnow() - self.last_flush).total_seconds() * 1000)
    
    def _country_to_continent(self, country: str) -> str:
        """Map country to continent."""
        # Simplified mapping
        mapping = {
            "US": "NA", "CA": "NA", "MX": "NA",
            "GB": "EU", "DE": "EU", "FR": "EU",
            "JP": "AS", "KR": "AS", "IN": "AS",
            # ... more mappings
        }
        return mapping.get(country, "OTHER")

Interviewer: "Good. Now tell me about the streaming layer — how do you calculate trending content in real-time?"

You: "The trending calculation is a classic streaming problem. We need to balance freshness with accuracy."

The Problem

TRENDING CALCULATION CHALLENGES

Naive approach (count plays in last hour):
├── Favors long content (more heartbeats)
├── Doesn't account for library size
├── New releases always dominate
├── Spikes from bots/fraud not detected

Better approach:
├── Weight by unique viewers, not events
├── Normalize by content age and genre
├── Compare to baseline (is this unusually popular?)
├── Decay older activity (recent matters more)

# Streaming trending calculation with Flink

from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Dict, Optional
import math


@dataclass
class TrendingScore:
    """Trending score for a piece of content."""
    content_id: str
    score: float
    unique_viewers_1h: int
    unique_viewers_24h: int
    velocity: float  # Rate of change
    percentile_rank: float
    updated_at: datetime


class TrendingCalculator:
    """
    Calculates trending scores using streaming data.
    
    Algorithm:
    1. Count unique viewers in sliding windows
    2. Calculate velocity (change in viewership)
    3. Normalize by content category baseline
    4. Apply time decay for recency
    5. Combine into final score
    """
    
    def __init__(self):
        # Viewer counts per content per window
        self.viewers_1h: Dict[str, set] = {}   # Last 1 hour
        self.viewers_24h: Dict[str, set] = {}  # Last 24 hours
        
        # Historical baselines by category
        self.category_baselines: Dict[str, float] = {}
        
        # Previous scores for velocity calculation
        self.previous_scores: Dict[str, float] = {}
    
    def process_play_event(self, event: Dict):
        """
        Process a playback event and update trending state.
        """
        
        content_id = event["payload"]["content_id"]
        user_id = event["user_id"]
        
        # Add to viewer sets (HyperLogLog in production)
        if content_id not in self.viewers_1h:
            self.viewers_1h[content_id] = set()
            self.viewers_24h[content_id] = set()
        
        self.viewers_1h[content_id].add(user_id)
        self.viewers_24h[content_id].add(user_id)
    
    def calculate_trending_scores(
        self,
        content_metadata: Dict[str, Dict]
    ) -> list[TrendingScore]:
        """
        Calculate trending scores for all content.
        
        Called every 30-60 seconds.
        """
        
        scores = []
        
        for content_id, viewers_1h in self.viewers_1h.items():
            if len(viewers_1h) < 10:  # Minimum threshold
                continue
            
            metadata = content_metadata.get(content_id, {})
            category = metadata.get("category", "unknown")
            
            # Base score: unique viewers in last hour
            base_score = len(viewers_1h)
            
            # Velocity: how fast is viewership growing?
            viewers_24h = len(self.viewers_24h.get(content_id, set()))
            velocity = self._calculate_velocity(
                content_id, base_score, viewers_24h
            )
            
            # Normalize by category baseline
            baseline = self.category_baselines.get(category, 1000)
            normalized_score = base_score / baseline
            
            # Apply velocity boost (trending UP gets bonus)
            velocity_multiplier = 1 + (velocity * 0.5) if velocity > 0 else 1
            
            # Final score
            final_score = normalized_score * velocity_multiplier
            
            # Store for next velocity calculation
            self.previous_scores[content_id] = base_score
            
            scores.append(TrendingScore(
                content_id=content_id,
                score=final_score,
                unique_viewers_1h=base_score,
                unique_viewers_24h=viewers_24h,
                velocity=velocity,
                percentile_rank=0,  # Calculated after sorting
                updated_at=datetime.utcnow()
            ))
        
        # Sort and assign percentile ranks
        scores.sort(key=lambda x: x.score, reverse=True)
        for i, score in enumerate(scores):
            score.percentile_rank = 1 - (i / len(scores))
        
        return scores
    
    def _calculate_velocity(
        self,
        content_id: str,
        current_1h: int,
        current_24h: int
    ) -> float:
        """
        Calculate velocity (rate of change) of viewership.
        
        Positive = trending up, Negative = trending down
        """
        
        previous = self.previous_scores.get(content_id, current_1h)
        
        if previous == 0:
            return 0
        
        # Percentage change from previous calculation
        change = (current_1h - previous) / previous
        
        # Also factor in 1h vs 24h ratio
        # If 1h viewers >> 24h/24 average, content is spiking
        hourly_avg_24h = current_24h / 24
        spike_factor = current_1h / hourly_avg_24h if hourly_avg_24h > 0 else 1
        
        return change * spike_factor
    
    def expire_old_data(self, window_cutoff: datetime):
        """
        Remove old viewer data from windows.
        
        In production, use Flink's windowing with watermarks.
        """
        # Implementation would use event time windows
        pass


# Flink job definition (pseudo-code)
FLINK_TRENDING_JOB = """
-- Flink SQL for trending calculation

-- Step 1: Deduplicate events by user per content per window
CREATE VIEW unique_viewers AS
SELECT 
    content_id,
    COUNT(DISTINCT user_id) as viewer_count,
    TUMBLE_END(event_time, INTERVAL '1' HOUR) as window_end
FROM playback_events
WHERE event_type = 'playback.started'
GROUP BY 
    content_id,
    TUMBLE(event_time, INTERVAL '1' HOUR);

-- Step 2: Calculate velocity using LAG
CREATE VIEW trending_with_velocity AS
SELECT 
    content_id,
    viewer_count,
    window_end,
    viewer_count - LAG(viewer_count) OVER (
        PARTITION BY content_id 
        ORDER BY window_end
    ) as velocity
FROM unique_viewers;

-- Step 3: Join with metadata and calculate final score
CREATE VIEW trending_scores AS
SELECT 
    t.content_id,
    t.viewer_count,
    t.velocity,
    t.viewer_count / m.category_baseline * 
        (1 + GREATEST(t.velocity, 0) * 0.5) as trending_score,
    t.window_end
FROM trending_with_velocity t
JOIN content_metadata m ON t.content_id = m.content_id;
"""

Output to ClickHouse

# Sink trending scores to ClickHouse for dashboards

class TrendingClickHouseSink:
    """
    Writes trending scores to ClickHouse for dashboard queries.
    """
    
    async def write_scores(self, scores: list[TrendingScore]):
        """
        Write batch of trending scores.
        
        Uses ReplacingMergeTree to handle updates.
        """
        
        values = [
            (
                s.content_id,
                s.score,
                s.unique_viewers_1h,
                s.unique_viewers_24h,
                s.velocity,
                s.percentile_rank,
                s.updated_at
            )
            for s in scores
        ]
        
        await self.client.execute(
            """
            INSERT INTO trending_scores 
            (content_id, score, viewers_1h, viewers_24h, 
             velocity, percentile_rank, updated_at)
            VALUES
            """,
            values
        )


CLICKHOUSE_SCHEMA = """
CREATE TABLE trending_scores (
    content_id String,
    score Float64,
    viewers_1h UInt32,
    viewers_24h UInt32,
    velocity Float64,
    percentile_rank Float64,
    updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY content_id;

-- Materialized view for top 100 trending
CREATE MATERIALIZED VIEW top_trending
ENGINE = Memory
AS
SELECT *
FROM trending_scores
ORDER BY score DESC
LIMIT 100;
"""

Interviewer: "How do you handle late data from mobile devices that were offline?"

Deep Dive 4: Late Data Handling (Day 4 Concept)

You: "Late data is particularly challenging for a streaming service because mobile users often watch offline on planes or subways."

The Problem

LATE DATA SCENARIOS

Scenario 1: Airplane mode viewing
├── User downloads movie at airport (10:00 AM)
├── Watches on 3-hour flight
├── Events buffered on device
├── Lands and syncs at 1:00 PM
└── Events arrive 3 hours late

Scenario 2: Subway commute
├── User starts episode at 8:00 AM
├── Goes underground, loses connectivity
├── Watches 30 minutes, events buffer
├── Emerges at 8:45 AM, events sync
└── Events arrive 45 minutes late

Scenario 3: App in background
├── User pauses and switches apps
├── App is killed by OS
├── Events in memory lost? 
└── Need client-side persistence

Impact without handling:
├── Trending misses popular content
├── Completion rates underreported
├── Geographic data skewed to connected areas
├── A/B test results biased toward always-connected users

Solution Architecture

LATE DATA HANDLING ARCHITECTURE

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  CLIENT-SIDE                                                           │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Event Buffer (SQLite on device)                                 │  │
│  │  ├── Persist events immediately on generation                    │  │ 
│  │  ├── Retry sending on connectivity                               │  │
│  │  ├── Include original event_time (not send time)                 │  │
│  │  └── Delete only after server acknowledgment                     │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                             │                                          │
│                             ▼                                          │
│  SERVER-SIDE                                                           │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │  Ingestion Layer                                                  │ │
│  │  ├── Accept events regardless of age                              │ │
│  │  ├── Tag with received_time for lateness tracking                 │ │
│  │  ├── Route old events (>1hr) to late_events topic                 │ │
│  │  └── Route recent events to normal topics                         │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                             │                                          │
│              ┌──────────────┴──────────────┐                           │
│              ▼                             ▼                           │
│  ┌──────────────────────┐     ┌──────────────────────┐                 │
│  │  Streaming (Flink)   │     │  Late Events Queue   │                 │
│  │  ├── 5-min watermark │     │  ├── Batch process   │                 │
│  │  ├── On-time results │     │  ├── Daily catchup   │                 │
│  │  └── "Provisional"   │     │  └── "Final" numbers │                 │
│  └──────────┬───────────┘     └──────────┬───────────┘                 │
│             │                            │                             │
│             ▼                            ▼                             │
│  ┌──────────────────────┐     ┌──────────────────────┐                 │
│  │  ClickHouse          │     │  BigQuery            │                 │
│  │  (Real-time view)    │     │  (Final view)        │                 │
│  └──────────────────────┘     └──────────────────────┘                 │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Implementation

# Late data handling with versioned results

from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Dict, List, Optional
from enum import Enum


class DataFreshness(Enum):
    REALTIME = "realtime"      # Streaming result, may change
    PRELIMINARY = "preliminary" # T+1, most late data included
    FINAL = "final"            # T+7, official number


@dataclass
class ViewingMetric:
    """A metric with versioning for late data."""
    content_id: str
    date: str
    metric_name: str
    value: float
    freshness: DataFreshness
    computed_at: datetime
    event_count: int
    late_event_count: int


class LateDataAwareAggregator:
    """
    Aggregates viewing metrics with late data handling.
    
    Maintains multiple versions:
    - Real-time: Updated every minute, provisional
    - Preliminary: Updated T+1, includes most late data
    - Final: Updated T+7, official numbers
    """
    
    def __init__(self, metrics_store):
        self.store = metrics_store
        
        # Late data tracking
        self.lateness_histogram: Dict[str, int] = {}
    
    async def process_event(self, event: Dict):
        """
        Process event and update appropriate metrics version.
        """
        
        event_time = datetime.fromisoformat(event["event_time"])
        received_time = datetime.fromisoformat(event["received_time"])
        
        lateness = received_time - event_time
        lateness_bucket = self._get_lateness_bucket(lateness)
        self.lateness_histogram[lateness_bucket] = \
            self.lateness_histogram.get(lateness_bucket, 0) + 1
        
        event_date = event_time.date().isoformat()
        
        # Determine which version to update
        if lateness < timedelta(minutes=5):
            # On-time: update real-time
            await self._update_realtime(event, event_date)
        elif lateness < timedelta(hours=24):
            # Moderately late: queue for preliminary
            await self._queue_for_preliminary(event, event_date)
        else:
            # Very late: queue for final processing
            await self._queue_for_final(event, event_date)
    
    async def _update_realtime(self, event: Dict, event_date: str):
        """Update real-time metrics (streaming)."""
        
        content_id = event["payload"]["content_id"]
        
        # Increment view count
        await self.store.increment(
            content_id=content_id,
            date=event_date,
            metric="views",
            freshness=DataFreshness.REALTIME
        )
        
        # Update completion if applicable
        if event["event_type"] == "playback.completed":
            await self.store.increment(
                content_id=content_id,
                date=event_date,
                metric="completions",
                freshness=DataFreshness.REALTIME
            )
    
    async def _queue_for_preliminary(self, event: Dict, event_date: str):
        """Queue late event for preliminary processing."""
        await self.store.queue_late_event(
            event=event,
            target_freshness=DataFreshness.PRELIMINARY
        )
    
    async def _queue_for_final(self, event: Dict, event_date: str):
        """Queue very late event for final processing."""
        await self.store.queue_late_event(
            event=event,
            target_freshness=DataFreshness.FINAL
        )
    
    async def run_preliminary_job(self, target_date: str):
        """
        Run T+1 job to produce preliminary numbers.
        
        Scheduled to run at 6 AM for previous day.
        """
        
        # Read all events for target_date from data lake
        events = await self.store.read_events_for_date(target_date)
        
        # Also read late events that arrived after streaming cutoff
        late_events = await self.store.read_late_events(
            target_date=target_date,
            max_lateness=timedelta(hours=30)
        )
        
        all_events = events + late_events
        
        # Compute aggregates
        metrics = self._compute_daily_metrics(all_events, target_date)
        
        # Store as preliminary
        for metric in metrics:
            metric.freshness = DataFreshness.PRELIMINARY
            await self.store.save_metric(metric)
    
    async def run_final_job(self, target_date: str):
        """
        Run T+7 job to produce final numbers.
        
        Scheduled to run 7 days after target_date.
        """
        
        # Read ALL events including very late ones
        events = await self.store.read_events_for_date(
            target_date,
            include_very_late=True
        )
        
        # Compute final aggregates
        metrics = self._compute_daily_metrics(events, target_date)
        
        # Store as final (immutable after this)
        for metric in metrics:
            metric.freshness = DataFreshness.FINAL
            await self.store.save_metric(metric)
    
    def _compute_daily_metrics(
        self,
        events: List[Dict],
        date: str
    ) -> List[ViewingMetric]:
        """Compute all metrics for a day."""
        
        by_content: Dict[str, Dict] = {}
        
        for event in events:
            content_id = event["payload"]["content_id"]
            
            if content_id not in by_content:
                by_content[content_id] = {
                    "views": 0,
                    "unique_viewers": set(),
                    "completions": 0,
                    "total_watch_time": 0,
                    "event_count": 0,
                    "late_event_count": 0
                }
            
            stats = by_content[content_id]
            stats["event_count"] += 1
            
            # Check if late
            event_time = datetime.fromisoformat(event["event_time"])
            received_time = datetime.fromisoformat(event["received_time"])
            if (received_time - event_time) > timedelta(minutes=5):
                stats["late_event_count"] += 1
            
            if event["event_type"] == "playback.started":
                stats["views"] += 1
                stats["unique_viewers"].add(event["user_id"])
            
            if event["event_type"] == "playback.completed":
                stats["completions"] += 1
            
            if event["event_type"] == "playback.heartbeat":
                stats["total_watch_time"] += 10  # 10 second heartbeats
        
        # Convert to metrics
        metrics = []
        for content_id, stats in by_content.items():
            metrics.extend([
                ViewingMetric(
                    content_id=content_id,
                    date=date,
                    metric_name="views",
                    value=stats["views"],
                    freshness=DataFreshness.PRELIMINARY,
                    computed_at=datetime.utcnow(),
                    event_count=stats["event_count"],
                    late_event_count=stats["late_event_count"]
                ),
                ViewingMetric(
                    content_id=content_id,
                    date=date,
                    metric_name="unique_viewers",
                    value=len(stats["unique_viewers"]),
                    freshness=DataFreshness.PRELIMINARY,
                    computed_at=datetime.utcnow(),
                    event_count=stats["event_count"],
                    late_event_count=stats["late_event_count"]
                ),
                ViewingMetric(
                    content_id=content_id,
                    date=date,
                    metric_name="completion_rate",
                    value=stats["completions"] / stats["views"] if stats["views"] > 0 else 0,
                    freshness=DataFreshness.PRELIMINARY,
                    computed_at=datetime.utcnow(),
                    event_count=stats["event_count"],
                    late_event_count=stats["late_event_count"]
                )
            ])
        
        return metrics
    
    def _get_lateness_bucket(self, lateness: timedelta) -> str:
        """Bucket lateness for histogram."""
        seconds = lateness.total_seconds()
        
        if seconds < 60:
            return "<1min"
        elif seconds < 300:
            return "1-5min"
        elif seconds < 3600:
            return "5-60min"
        elif seconds < 86400:
            return "1-24hr"
        else:
            return ">24hr"

Interviewer: "How would you design the data model for the content analytics that creators see?"

Deep Dive 5: Star Schema for Content Analytics (Day 3 Concept)

You: "For content analytics, I'd use a classic star schema optimized for the queries creators care about."

Data Model

CONTENT ANALYTICS STAR SCHEMA

                      ┌─────────────────────┐
                      │    dim_content      │
                      │ ─────────────────── │
                      │ content_key (PK)    │
                      │ content_id          │
                      │ title               │
                      │ type (movie/series) │
                      │ genre               │
                      │ rating              │
                      │ release_date        │
                      │ duration_minutes    │
                      │ studio              │
                      │ creator_id          │
                      └─────────┬───────────┘
                                │
┌─────────────────┐            │            ┌─────────────────┐
│    dim_date     │            │            │   dim_viewer    │
│ ─────────────── │            │            │ ─────────────── │
│ date_key (PK)   │            │            │ viewer_key (PK) │
│ date            │            │            │ age_group       │
│ day_of_week     │     ┌──────┴──────┐     │ gender          │
│ month           │     │             │     │ country         │
│ quarter         │◄────┤ fact_views  ├────►│ subscription    │
│ year            │     │ ─────────── │     │ tenure_months   │
│ is_weekend      │     │ view_key    │     │ device_pref     │
│ is_holiday      │     │ date_key    │     └─────────────────┘
└─────────────────┘     │ content_key │
                        │ viewer_key  │     ┌─────────────────┐
                        │ device_key  │     │   dim_device    │
                        │ geo_key     │     │ ─────────────── │
                        │             │     │ device_key (PK) │
                        │ views       │     │ device_type     │
                        │ unique_vwrs │◄────│ os              │
                        │ watch_mins  │     │ app_version     │
                        │ completions │     │ screen_size     │
                        │ avg_pct     │     └─────────────────┘
                        └──────┬──────┘
                               │
                               │            ┌─────────────────┐
                               │            │    dim_geo      │
                               │            │ ─────────────── │
                               └───────────►│ geo_key (PK)    │
                                            │ country         │
                                            │ region          │
                                            │ city            │
                                            │ timezone        │
                                            └─────────────────┘

Schema Implementation

# Content analytics data model

BIGQUERY_SCHEMA = """
-- =============================================================================
-- Dimension Tables
-- =============================================================================

-- Content dimension (SCD Type 2 for metadata changes)
CREATE TABLE dim_content (
    content_key INT64 NOT NULL,
    content_id STRING NOT NULL,
    title STRING NOT NULL,
    content_type STRING NOT NULL,  -- 'movie', 'series', 'documentary'
    genre STRING,
    maturity_rating STRING,
    release_date DATE,
    duration_minutes INT64,
    studio STRING,
    creator_id STRING,
    
    -- SCD Type 2 fields
    valid_from DATE NOT NULL,
    valid_to DATE,
    is_current BOOL NOT NULL,
    
    -- Metadata
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

-- Date dimension (generated)
CREATE TABLE dim_date (
    date_key INT64 NOT NULL,  -- YYYYMMDD
    date DATE NOT NULL,
    day_of_week INT64,
    day_name STRING,
    month INT64,
    month_name STRING,
    quarter INT64,
    year INT64,
    is_weekend BOOL,
    is_holiday BOOL
);

-- Viewer dimension (anonymized/bucketed for privacy)
CREATE TABLE dim_viewer (
    viewer_key INT64 NOT NULL,
    age_group STRING,  -- '18-24', '25-34', etc.
    gender STRING,
    country STRING,
    subscription_type STRING,  -- 'basic', 'standard', 'premium'
    tenure_months INT64,
    preferred_device STRING
);

-- Device dimension
CREATE TABLE dim_device (
    device_key INT64 NOT NULL,
    device_type STRING,  -- 'mobile', 'tv', 'web', 'gaming'
    operating_system STRING,
    app_version_major INT64,
    screen_category STRING  -- 'small', 'medium', 'large'
);

-- Geography dimension
CREATE TABLE dim_geo (
    geo_key INT64 NOT NULL,
    country STRING,
    region STRING,
    timezone STRING
);


-- =============================================================================
-- Fact Tables
-- =============================================================================

-- Daily viewing facts (grain: content × date × viewer_segment × device × geo)
CREATE TABLE fact_daily_views (
    date_key INT64 NOT NULL,
    content_key INT64 NOT NULL,
    viewer_key INT64 NOT NULL,
    device_key INT64 NOT NULL,
    geo_key INT64 NOT NULL,
    
    -- Measures
    view_count INT64,
    unique_viewers INT64,  -- Approximate, using HLL
    watch_minutes INT64,
    completion_count INT64,
    
    -- Derived measures
    avg_completion_pct FLOAT64,
    avg_watch_minutes FLOAT64,
    
    -- Data quality
    freshness STRING,  -- 'realtime', 'preliminary', 'final'
    late_event_pct FLOAT64,
    
    -- Partitioning and clustering
    _PARTITIONDATE DATE
)
PARTITION BY _PARTITIONDATE
CLUSTER BY content_key, geo_key;


-- =============================================================================
-- Pre-Aggregated Rollup Tables
-- =============================================================================

-- Daily content summary (most common query)
CREATE TABLE agg_daily_content (
    date DATE NOT NULL,
    content_id STRING NOT NULL,
    
    total_views INT64,
    unique_viewers INT64,
    watch_hours FLOAT64,
    completions INT64,
    completion_rate FLOAT64,
    
    -- Breakdown by device
    mobile_views INT64,
    tv_views INT64,
    web_views INT64,
    
    -- Breakdown by top geos
    us_views INT64,
    uk_views INT64,
    other_views INT64,
    
    freshness STRING,
    updated_at TIMESTAMP
)
PARTITION BY date
CLUSTER BY content_id;


-- Weekly content summary
CREATE TABLE agg_weekly_content (
    week_start DATE NOT NULL,
    content_id STRING NOT NULL,
    
    total_views INT64,
    unique_viewers INT64,
    watch_hours FLOAT64,
    avg_daily_views FLOAT64,
    peak_day_views INT64,
    
    freshness STRING,
    updated_at TIMESTAMP
);
"""


# Query examples for creator dashboard

CREATOR_DASHBOARD_QUERIES = """
-- Query 1: Content performance summary (last 30 days)
-- Uses agg_daily_content rollup for speed

SELECT 
    c.title,
    c.content_type,
    SUM(a.total_views) as total_views,
    SUM(a.unique_viewers) as unique_viewers,
    ROUND(SUM(a.watch_hours), 1) as watch_hours,
    ROUND(AVG(a.completion_rate) * 100, 1) as avg_completion_pct,
    SUM(a.mobile_views) / SUM(a.total_views) * 100 as mobile_pct
FROM agg_daily_content a
JOIN dim_content c ON a.content_id = c.content_id
WHERE a.date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
  AND c.creator_id = @creator_id
  AND c.is_current = TRUE
GROUP BY c.title, c.content_type
ORDER BY total_views DESC;


-- Query 2: Daily trend for specific content
SELECT 
    a.date,
    a.total_views,
    a.unique_viewers,
    a.completion_rate,
    a.freshness
FROM agg_daily_content a
WHERE a.content_id = @content_id
  AND a.date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
ORDER BY a.date;


-- Query 3: Audience breakdown (uses fact table for detail)
SELECT 
    v.age_group,
    v.gender,
    g.country,
    SUM(f.unique_viewers) as viewers,
    AVG(f.avg_completion_pct) as avg_completion
FROM fact_daily_views f
JOIN dim_viewer v ON f.viewer_key = v.viewer_key
JOIN dim_geo g ON f.geo_key = g.geo_key
JOIN dim_content c ON f.content_key = c.content_key
WHERE c.content_id = @content_id
  AND f.date_key >= @start_date_key
  AND f.date_key <= @end_date_key
GROUP BY v.age_group, v.gender, g.country
ORDER BY viewers DESC;


-- Query 4: Comparison to similar content
WITH my_content AS (
    SELECT 
        SUM(total_views) as views,
        AVG(completion_rate) as completion_rate
    FROM agg_daily_content
    WHERE content_id = @content_id
      AND date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
),
similar_content AS (
    SELECT 
        AVG(a.total_views) as avg_views,
        AVG(a.completion_rate) as avg_completion_rate
    FROM agg_daily_content a
    JOIN dim_content c ON a.content_id = c.content_id
    WHERE c.genre = @genre
      AND c.content_type = @content_type
      AND a.date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
)
SELECT 
    m.views,
    m.completion_rate,
    s.avg_views as category_avg_views,
    s.avg_completion_rate as category_avg_completion,
    m.views / s.avg_views as views_vs_category,
    m.completion_rate / s.avg_completion_rate as completion_vs_category
FROM my_content m, similar_content s;
"""

Interviewer: "Good. How do you keep the queries fast for creators when there's so much data?"

Deep Dive 6: Query Optimization (Day 5 Concept)

You: "Query optimization for creator analytics requires a multi-layered approach."

Optimization Strategy

QUERY OPTIMIZATION LAYERS

Layer 1: PRE-AGGREGATION
─────────────────────────
├── agg_daily_content: Most common queries hit this
├── agg_weekly_content: Week-over-week comparisons
├── Refreshed: Daily at 6 AM (after preliminary job)
└── Result: 90% of queries use rollups, not fact tables

Layer 2: CACHING
────────────────
├── Redis: Dashboard results (5-minute TTL)
├── CDN: Static dashboard assets
├── Query cache: Repeated queries (content_id specific)
└── Result: 80% cache hit rate on dashboards

Layer 3: PARTITIONING
─────────────────────
├── Partition by date (most queries filter by date range)
├── Cluster by content_id (most queries filter by content)
├── Result: Queries scan <1% of data

Layer 4: MATERIALIZED VIEWS
───────────────────────────
├── top_100_by_genre: Pre-computed rankings
├── creator_summary: Aggregated metrics per creator
├── Refreshed: Hourly
└── Result: Instant response for common questions

Caching Implementation

# Multi-layer caching for creator dashboard

from datetime import timedelta


class CreatorDashboardCache:
    """
    Caching strategy for creator dashboard queries.
    """
    
    def __init__(self, redis_client, bigquery_client):
        self.redis = redis_client
        self.bq = bigquery_client
        
        # TTL configuration by query type
        self.ttl_config = {
            "summary_30d": timedelta(minutes=5),     # Updates frequently
            "daily_trend": timedelta(minutes=15),    # Less frequent
            "audience_breakdown": timedelta(hours=1), # Slow-changing
            "comparison": timedelta(hours=1),        # Slow-changing
        }
    
    async def get_content_summary(
        self,
        creator_id: str,
        days: int = 30
    ) -> dict:
        """
        Get content performance summary with caching.
        """
        
        cache_key = f"creator:{creator_id}:summary:{days}d"
        
        # Check cache
        cached = await self.redis.get(cache_key)
        if cached:
            return {
                "data": cached,
                "cache_hit": True,
                "freshness": "cached"
            }
        
        # Query BigQuery
        result = await self._query_content_summary(creator_id, days)
        
        # Cache result
        ttl = self.ttl_config["summary_30d"]
        await self.redis.setex(cache_key, ttl.total_seconds(), result)
        
        return {
            "data": result,
            "cache_hit": False,
            "freshness": result.get("freshness", "unknown")
        }
    
    async def get_daily_trend(
        self,
        content_id: str,
        days: int = 90
    ) -> dict:
        """
        Get daily viewing trend for content.
        """
        
        cache_key = f"content:{content_id}:trend:{days}d"
        
        cached = await self.redis.get(cache_key)
        if cached:
            return {"data": cached, "cache_hit": True}
        
        result = await self._query_daily_trend(content_id, days)
        
        ttl = self.ttl_config["daily_trend"]
        await self.redis.setex(cache_key, ttl.total_seconds(), result)
        
        return {"data": result, "cache_hit": False}
    
    async def invalidate_for_content(self, content_id: str):
        """
        Invalidate all caches for a content item.
        
        Called when new data arrives.
        """
        
        pattern = f"content:{content_id}:*"
        keys = await self.redis.keys(pattern)
        
        if keys:
            await self.redis.delete(*keys)
    
    async def warm_cache_for_creator(self, creator_id: str):
        """
        Pre-warm cache for a creator's common queries.
        
        Called when creator logs into dashboard.
        """
        
        # Fire off all common queries in parallel
        await asyncio.gather(
            self.get_content_summary(creator_id, 30),
            self.get_content_summary(creator_id, 7),
            # Don't pre-warm content-specific queries
            # (too many possible content_ids)
        )

Phase 5: Scaling and Edge Cases (5 minutes)

Interviewer: "How would this system scale to 10x the current load?"

Scaling Strategy

You: "Let me walk through scaling each layer."

SCALING TO 10X (10M events/second, 100M concurrent viewers)

INGESTION LAYER
───────────────
Current: 20 collectors per region
10x:     200 collectors per region
         OR use Kinesis/Kafka as first hop (auto-scaling)

Strategy:
├── Horizontal scaling (stateless collectors)
├── Regional Kafka clusters (reduce cross-region traffic)
├── Client-side sampling for heartbeats (10% sample at 10x)
└── Dedicated clusters for high-volume event types

KAFKA
─────
Current: 50 partitions per topic
10x:     500 partitions per topic

Strategy:
├── More partitions = more parallelism
├── Tiered storage (hot in Kafka, warm in S3)
├── Consider separating heartbeats to dedicated cluster
└── MirrorMaker 2 for cross-region replication

STREAMING (FLINK)
─────────────────
Current: 100 workers
10x:     1000 workers

Strategy:
├── Scale horizontally (Flink scales well)
├── Increase parallelism per operator
├── State backend: RocksDB with incremental checkpoints
├── Consider sampling for trending (don't need every event)
└── Key-based scaling (partition by user_id or content_id)

SERVING (CLICKHOUSE)
────────────────────
Current: 20 nodes
10x:     100+ nodes OR switch to managed service

Strategy:
├── Add more replicas for read scaling
├── Sharding by content_id for write scaling
├── More aggressive pre-aggregation
└── Consider Druid for highest concurrency

COST CONSIDERATION
──────────────────
At 10x scale:
├── Sampling heartbeats: 90% cost reduction on that event type
├── Tiered storage: 50% storage cost reduction
├── Pre-aggregation: 80% query cost reduction
├── Regional processing: 30% egress cost reduction
└── Estimated 10x traffic = 3-4x cost (with optimizations)

Edge Cases

Interviewer: "What edge cases should we handle?"

You: "Several important ones..."

EDGE CASES

1. VIRAL CONTENT
   Problem: Single title gets 10x normal traffic
   Solution:
   ├── Dedicated Kafka partition for hot content
   ├── Pre-computed trending (don't recalculate on every query)
   ├── Aggressive caching (1-second TTL for hot content)
   └── Circuit breaker on per-content queries

2. NEW RELEASE SPIKE
   Problem: Major release causes traffic spike at specific time
   Solution:
   ├── Pre-scale infrastructure before release
   ├── Warm caches before release (predicted queries)
   ├── Feature flags to disable non-critical analytics
   └── Graceful degradation (show cached, mark as stale)

3. DEVICE CLOCK SKEW
   Problem: Device sends events with wrong timestamps
   Solution:
   ├── Server-side received_time is source of truth
   ├── Reject events with future timestamps > 5 minutes
   ├── Accept late events up to 7 days old
   └── Flag suspicious patterns for investigation

4. BOT/FRAUD TRAFFIC
   Problem: Fake views to inflate metrics
   Solution:
   ├── Device fingerprinting
   ├── Behavioral analysis (impossible view patterns)
   ├── Rate limiting per device/account
   └── Separate raw and validated metrics

5. GDPR DELETION
   Problem: User requests data deletion
   Solution:
   ├── Aggregated data is anonymized (OK to keep)
   ├── Raw events must be deleted
   ├── Daily deletion job processes requests
   └── 30-day SLA for completion

Phase 6: Monitoring and Operations

Interviewer: "How would you monitor this system in production?"

Key Metrics

You: "I'd monitor at multiple levels..."

┌────────────────────────────────────────────────────────────────────────┐
│                    ANALYTICS PLATFORM MONITORING                       │
│                                                                        │
│  INGESTION HEALTH                                                      │
│  ├── Events/second (by region):    Target: 200K/region                 │
│  ├── Validation error rate:        Alert if > 1%                       │
│  ├── P99 ingestion latency:        Alert if > 500ms                    │
│  └── Dead letter queue depth:      Alert if > 10K                      │
│                                                                        │
│  KAFKA HEALTH                                                          │
│  ├── Consumer lag (by topic):      Alert if > 1M events                │
│  ├── Under-replicated partitions:  Alert if > 0                        │
│  ├── Produce latency P99:          Alert if > 100ms                    │
│  └── Disk usage:                   Alert if > 80%                      │
│                                                                        │
│  STREAMING HEALTH                                                      │
│  ├── Flink checkpoint duration:    Alert if > 60s                      │
│  ├── Watermark lag:                Alert if > 5 minutes                │
│  ├── Backpressure ratio:           Alert if > 0.5                      │
│  └── State size:                   Track growth rate                   │
│                                                                        │
│  QUERY HEALTH                                                          │
│  ├── Dashboard P95 latency:        Target < 1s                         │
│  ├── Cache hit rate:               Target > 80%                        │
│  ├── Query error rate:             Alert if > 0.1%                     │
│  └── Bytes scanned/query:          Track for cost                      │
│                                                                        │
│  DATA QUALITY                                                          │
│  ├── Late data percentage:         Track by device type                │
│  ├── Missing events (gap detection): Alert on gaps                     │
│  ├── Schema validation failures:   Alert if > 0.1%                     │
│  └── Duplicate event rate:         Track for dedup health              │
│                                                                        │
│  BUSINESS METRICS                                                      │
│  ├── Trending freshness:           Target < 60s                        │
│  ├── Real-time vs Final delta:     Track accuracy                      │
│  ├── Creator dashboard usage:      Engagement tracking                 │
│  └── A/B test data completeness:   Compare to expected                 │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Alerting Strategy

ALERTING TIERS

CRITICAL (PagerDuty - Wake someone up):
├── Event ingestion completely down
├── Kafka cluster unhealthy
├── Flink job failed and not recovering
├── Dashboard error rate > 5%
├── Data loss detected

HIGH (PagerDuty - Business hours):
├── Consumer lag > 10 minutes
├── Late data rate > 20%
├── Query latency P99 > 10s
├── Trending not updating > 5 minutes

MEDIUM (Slack alert):
├── Validation error rate elevated
├── Cache hit rate dropped
├── Single region ingestion issues
├── Schema registry alerts

LOW (Dashboard only):
├── Cost anomalies
├── Traffic pattern changes
├── Performance trends

Interview Conclusion

Interviewer: "Excellent work. You've covered event ingestion, streaming processing, data modeling, late data handling, and query optimization. You clearly understand the trade-offs involved. Any questions for me?"

You: "Thank you! I'd love to hear about the biggest operational challenges you've faced with the current analytics platform, and whether there are any aspects of my design that you think wouldn't work in your environment."

Summary: Week 8 Concepts Applied

Day 1: Event Ingestion at Scale

Concept	Application
Event schema design	Envelope pattern with versioned payloads
Schema evolution	Backward-compatible changes with version field
Validation pipeline	Fast validation before Kafka produce
Throughput optimization	Batching, async produces, regional collectors
Dead letter queues	Route failed events for analysis

Day 2: Streaming vs Batch Processing

Concept	Application
Lambda architecture	Streaming (Flink) for real-time, Batch (Spark) for accuracy
Watermarks	5-minute allowed lateness for streaming
Window aggregations	1-hour tumbling windows for trending
Stateful processing	HyperLogLog for unique viewer counts
Kappa consideration	Could simplify if reprocessing time acceptable

Day 3: Data Modeling for Analytics

Concept	Application
Star schema	fact_daily_views with 5 dimension tables
SCD Type 2	Content dimension for metadata changes
Pre-aggregation	agg_daily_content, agg_weekly_content rollups
Partitioning	By date with clustering on content_id
Grain definition	Daily × content × viewer_segment × device × geo

Day 4: Late-Arriving Data

Concept	Application
Event time vs processing time	Client event_time for analytics, server received_time for lateness
Watermark strategies	5-minute streaming, 24-hour preliminary, 7-day final
Versioned results	Real-time, Preliminary (T+1), Final (T+7)
Reprocessing jobs	Daily preliminary job, weekly final job
Client-side buffering	SQLite persistence on mobile devices

Day 5: Query Layer and Optimization

Concept	Application
OLAP database selection	ClickHouse for real-time, BigQuery for historical
Multi-layer caching	Redis for dashboard results, CDN for assets
Pre-aggregation strategy	Daily/weekly rollups for common queries
Multi-tenancy	Creator-specific caches, query isolation
Cost optimization	Sampling heartbeats, tiered storage, rollups

Code Patterns Demonstrated

1. HIGH-THROUGHPUT INGESTION
   ├── Async batched produces
   ├── Schema validation with compiled validators
   └── Fire-and-forget with acknowledgment

2. STREAMING AGGREGATION
   ├── Windowed unique counts (HyperLogLog)
   ├── Velocity calculation for trending
   └── State management with cleanup

3. LATE DATA VERSIONING
   ├── Multiple freshness levels
   ├── Scheduled reprocessing jobs
   └── Lateness histogram tracking

4. STAR SCHEMA MODELING
   ├── Fact and dimension tables
   ├── Pre-aggregated rollups
   └── Efficient query patterns

5. QUERY CACHING
   ├── TTL based on query type
   ├── Cache warming on login
   └── Invalidation on data update

Self-Assessment Checklist

After studying this capstone, you should be able to:

This capstone integrates all concepts from Week 8: Analytics Pipeline of the System Design Mastery Series.

Total Week 8 Content: Preview + 5 Daily Documents + Capstone = ~14,000 lines

Back to Course Overview

Week 8 Capstone: Real-Time Analytics for Video Streaming Platform

🎯 A Complete System Design Interview Integrating All Week 8 Concepts

The Interview Begins

Phase 1: Requirements Clarification (5 minutes)

Your Questions

Functional Requirements

Non-Functional Requirements

Phase 2: Back of the Envelope Estimation (5 minutes)

Traffic Estimation

Storage Estimation

Key Metrics Summary

Phase 3: High-Level Design (10 minutes)

System Architecture

Component Breakdown

1. Ingestion Layer

2. Message Queue (Kafka)

3. Streaming Layer (Flink)

4. Batch Layer (Spark)

5. Serving Layer

Phase 4: Deep Dives (20 minutes)

Deep Dive 1: Event Schema Design (Day 1 Concept)

The Problem

The Solution

Event Types

Schema Evolution

Deep Dive 2: High-Throughput Ingestion (Day 1 Concept)

Ingestion Architecture

Implementation

Deep Dive 3: Real-Time Trending Calculation (Day 2 Concept)

The Problem

Trending Algorithm

Output to ClickHouse

Deep Dive 4: Late Data Handling (Day 4 Concept)

The Problem

Solution Architecture

Implementation

Deep Dive 5: Star Schema for Content Analytics (Day 3 Concept)

Data Model

Schema Implementation

Deep Dive 6: Query Optimization (Day 5 Concept)

Optimization Strategy

Caching Implementation

Phase 5: Scaling and Edge Cases (5 minutes)

Scaling Strategy

Edge Cases

Phase 6: Monitoring and Operations

Key Metrics

Alerting Strategy

Interview Conclusion

Summary: Week 8 Concepts Applied

Day 1: Event Ingestion at Scale

Day 2: Streaming vs Batch Processing

Day 3: Data Modeling for Analytics

Day 4: Late-Arriving Data

Day 5: Query Layer and Optimization

Code Patterns Demonstrated

Self-Assessment Checklist