Himanshu Kukreja
0%
LearnSystem DesignWeek 8Designing Analytics Pipeline Preview
Week Preview

Week 8 Preview: Analytics Pipeline

System Design Mastery Series — Building Blocks Week


Week Overview

┌────────────────────────────────────────────────────────────────────────────┐
│                                                                            │
│                    WEEK 8: ANALYTICS PIPELINE                              │
│                                                                            │
│          "From Raw Events to Actionable Insights"                          │
│                                                                            │
│  You've built systems that SERVE data (search, APIs, notifications).       │
│  Now you'll build systems that UNDERSTAND data.                            │
│                                                                            │
│  This week: Design the infrastructure that powers dashboards,              │
│  business intelligence, ML features, and data-driven decisions.            │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

The Problem We're Solving

THE ANALYTICS CHALLENGE

Your e-commerce platform generates events:
├── Page views: 50M/day
├── Clicks: 20M/day
├── Searches: 10M/day
├── Add to cart: 5M/day
├── Purchases: 500K/day
├── User actions: 100M/day total

Business needs answers:
├── "What's our conversion rate this hour?"
├── "Which products are trending right now?"
├── "What's the average order value by region?"
├── "Show me a funnel from search → view → cart → purchase"
├── "Which marketing campaigns drove revenue yesterday?"
├── "What's our real-time revenue for the flash sale?"

The challenge:
├── Volume: 100M+ events/day
├── Velocity: 1,000+ events/second
├── Variety: Different event types, schemas evolving
├── Latency: Some queries need real-time, others can wait
├── Accuracy: Business decisions depend on correct numbers
├── Cost: Can't afford to store and query everything naively

This week teaches you to build the pipeline that makes this possible.

What You'll Learn

Day 1: Event Ingestion at Scale

THEME: "Capturing everything without dropping anything"

TOPICS:
├── Event schema design and evolution
├── Ingestion patterns (push vs pull)
├── Kafka as the event backbone
├── Handling ingestion spikes
├── Event validation and enrichment
├── Exactly-once delivery guarantees

KEY QUESTION:
"How do you reliably capture 100M events/day
 without losing data or blocking producers?"

SYSTEM DESIGN:
Design an event ingestion layer for a ride-sharing app
tracking rides, driver locations, payments, and ratings.

Day 2: Streaming vs Batch Processing

THEME: "Real-time insights vs accurate aggregates"

TOPICS:
├── Lambda Architecture (batch + streaming)
├── Kappa Architecture (streaming only)
├── Stream processing with Flink/Spark Streaming
├── Batch processing with Spark
├── When to use which approach
├── Handling state in streaming

KEY QUESTION:
"When do you need real-time dashboards
 vs when can you wait for hourly batch jobs?"

SYSTEM DESIGN:
Build both real-time and batch pipelines for
computing metrics like DAU, conversion rate, and revenue.

Day 3: Data Modeling for Analytics

THEME: "Designing for queries, not transactions"

TOPICS:
├── Star schema and snowflake schema
├── Fact tables and dimension tables
├── Slowly changing dimensions (SCD)
├── Pre-aggregation strategies
├── Columnar storage (Parquet, ORC)
├── Partitioning for query performance

KEY QUESTION:
"How do you model data so a query across
 1 billion rows returns in under 5 seconds?"

SYSTEM DESIGN:
Design the data warehouse schema for an
e-commerce analytics platform.

Day 4: Late-Arriving Data and Correctness

THEME: "When 'real-time' isn't really real-time"

TOPICS:
├── Event time vs processing time
├── Watermarks and allowed lateness
├── Windowing strategies
├── Handling out-of-order events
├── Reprocessing and backfill
├── Exactly-once semantics in practice

KEY QUESTION:
"What happens when an event from 2 hours ago
 suddenly arrives? How do you update the dashboard?"

SYSTEM DESIGN:
Handle late-arriving mobile app events with
offline usage and poor connectivity.

Day 5: Query Layer and Optimization

THEME: "Making big data feel small"

TOPICS:
├── OLAP databases (ClickHouse, Druid, BigQuery)
├── Materialized views and rollups
├── Query optimization techniques
├── Caching strategies for dashboards
├── Cost management (storage vs compute)
├── Multi-tenancy in analytics

KEY QUESTION:
"How do you make a dashboard query over 1 year
 of data return in milliseconds?"

SYSTEM DESIGN:
Build the query layer for a multi-tenant
SaaS analytics platform.

Capstone: Real-Time Analytics for a Streaming Platform

SCENARIO:
Design the analytics infrastructure for a video streaming
platform (like Netflix or Hotstar).

REQUIREMENTS:
├── Track video starts, pauses, seeks, completions
├── Real-time "what's trending now" dashboard
├── Historical viewing patterns for recommendations
├── Content performance metrics for creators
├── A/B test analysis for UI experiments
├── Handle 10M concurrent viewers during cricket finals

This capstone integrates:
├── Event ingestion at massive scale
├── Real-time streaming aggregations
├── Batch processing for historical analysis
├── Data modeling for flexible queries
├── Late-arriving data from mobile apps
├── Query optimization for dashboards

Why This Week Matters

ANALYTICS IS EVERYWHERE

Every modern system needs analytics:
├── E-commerce: Conversion funnels, revenue tracking
├── Social Media: Engagement metrics, content performance
├── FinTech: Transaction monitoring, fraud detection
├── Gaming: Player behavior, monetization
├── SaaS: Usage metrics, churn prediction
├── Streaming: Viewing patterns, recommendations

As a backend engineer, you'll either:
├── Build analytics pipelines
├── Integrate with analytics systems
├── Debug why "the numbers don't match"
└── Optimize queries that are too slow

This week gives you the mental models to handle all of these.

Architecture Overview

┌───────────────────────────────────────────────────────────────────────────┐
│                    ANALYTICS PIPELINE ARCHITECTURE                        │
│                                                                           │
│  ┌────────────────────────────────────────────────────────────────────┐   │
│  │                         DATA SOURCES                               │   │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐   │   │
│  │  │  Web    │  │ Mobile  │  │  API    │  │  IoT    │  │  DB     │   │   │
│  │  │ Events  │  │  SDK    │  │  Logs   │  │ Sensors │  │  CDC    │   │   │
│  │  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘   │   │
│  └───────┼────────────┼────────────┼────────────┼────────────┼────────┘   │
│          │            │            │            │            │            │
│          └────────────┴─────┬──────┴────────────┴────────────┘            │
│                             ▼                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │                      INGESTION LAYER                                │  │
│  │                    (Kafka / Kinesis)                                │  │
│  │  ┌─────────────────────────────────────────────────────────────┐    │  │
│  │  │  Raw Events Topic  │  Validated Topic  │  Dead Letter Queue │    │  │
│  │  └─────────────────────────────────────────────────────────────┘    │  │
│  └─────────────────────────────┬───────────────────────────────────────┘  │
│                                │                                          │
│          ┌─────────────────────┼─────────────────────┐                    │
│          │                     │                     │                    │
│          ▼                     ▼                     ▼                    │
│  ┌───────────────┐    ┌───────────────┐    ┌───────────────┐              │
│  │   STREAMING   │    │    BATCH      │    │   RAW LAKE    │              │
│  │   PROCESSOR   │    │   PROCESSOR   │    │   (S3/GCS)    │              │
│  │  (Flink/KS)   │    │   (Spark)     │    │   Parquet     │              │
│  └───────┬───────┘    └───────┬───────┘    └───────────────┘              │
│          │                    │                                           │
│          ▼                    ▼                                           │
│  ┌───────────────┐    ┌───────────────┐                                   │
│  │  REAL-TIME    │    │   WAREHOUSE   │                                   │
│  │    STORE      │    │  (Snowflake/  │                                   │
│  │  (Druid/      │    │   BigQuery)   │                                   │
│  │   ClickHouse) │    │               │                                   │
│  └───────┬───────┘    └───────┬───────┘                                   │
│          │                    │                                           │
│          └─────────┬──────────┘                                           │
│                    ▼                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │                        QUERY LAYER                                  │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │  │
│  │  │ Dashboards  │  │   BI Tools  │  │  ML Pipelines│                 │  │
│  │  │  (Grafana)  │  │  (Tableau)  │  │  (Features)  │                 │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘                  │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                           │
└───────────────────────────────────────────────────────────────────────────┘

Key Concepts We'll Cover

1. Event Schema Design

EVENT SCHEMA PRINCIPLES

Every event needs:
├── event_id: Unique identifier (for deduplication)
├── event_type: What happened
├── event_time: When it happened (client timestamp)
├── received_time: When we received it (server timestamp)
├── user_id: Who did it (if known)
├── session_id: Which session
├── device_info: Platform, version, etc.
├── payload: Event-specific data

Schema evolution:
├── Add fields: Safe (default values)
├── Remove fields: Dangerous (break consumers)
├── Rename fields: Very dangerous
└── Solution: Schema registry, versioning

2. Lambda vs Kappa Architecture

LAMBDA ARCHITECTURE

        ┌─────────────────┐
        │   Raw Events    │
        └────────┬────────┘
                 │
      ┌──────────┴──────────┐
      ▼                     ▼
┌──────────┐          ┌──────────┐
│  Batch   │          │  Speed   │
│  Layer   │          │  Layer   │
│ (Spark)  │          │ (Flink)  │
└────┬─────┘          └────┬─────┘
     │                     │
     ▼                     ▼
┌──────────┐          ┌──────────┐
│  Batch   │          │  Real-   │
│  Views   │          │  Time    │
│ (hourly) │          │  Views   │
└────┬─────┘          └────┬─────┘
     │                     │
     └──────────┬──────────┘
                ▼
         ┌──────────┐
         │  Query   │
         │  Layer   │
         └──────────┘

Pros: Accurate batch, fast real-time
Cons: Two codebases, complexity


KAPPA ARCHITECTURE

        ┌─────────────────┐
        │   Raw Events    │
        │   (Immutable)   │
        └────────┬────────┘
                 │
                 ▼
        ┌─────────────────┐
        │    Streaming    │
        │    Processor    │
        │    (Flink)      │
        └────────┬────────┘
                 │
      ┌──────────┴──────────┐
      ▼                     ▼
┌──────────┐          ┌──────────┐
│  Real-   │          │ Serving  │
│  Time    │          │  Layer   │
│  Views   │          │ (tables) │
└──────────┘          └──────────┘

Pros: Single codebase, simpler
Cons: Reprocessing is harder

3. Star Schema

STAR SCHEMA FOR E-COMMERCE

                    ┌─────────────────┐
                    │   dim_product   │
                    │  ─────────────  │
                    │  product_id     │
                    │  name           │
                    │  category       │
                    │  brand          │
                    │  price          │
                    └────────┬────────┘
                             │
┌─────────────────┐          │          ┌─────────────────┐
│    dim_time     │          │          │   dim_customer  │
│  ─────────────  │          │          │  ─────────────  │
│  time_id        │          │          │  customer_id    │
│  date           │          │          │  name           │
│  hour           │    ┌─────┴─────┐    │  segment        │
│  day_of_week    │◄───┤fact_sales │───►│  region         │
│  month          │    ├───────────┤    │  signup_date    │
│  quarter        │    │ sale_id   │    └─────────────────┘
│  year           │    │ time_id   │
│  is_weekend     │    │ product_id│          ┌─────────────────┐
└─────────────────┘    │customer_id│          │  dim_location   │
                       │location_id│──────────│  ─────────────  │
                       │ quantity  │          │  location_id    │
                       │ revenue   │          │  city           │
                       │ discount  │          │  state          │
                       │ profit    │          │  country        │
                       └───────────┘          │  region         │
                                              └─────────────────┘

FACT TABLE: Measures (revenue, quantity)
DIMENSION TABLES: Context (who, what, when, where)

Query: "Revenue by category by month"
JOIN: fact_sales → dim_product → dim_time

4. Handling Late Data

LATE-ARRIVING DATA PROBLEM

Expected timeline:
  t=0: Event occurs on mobile app
  t=1s: Event sent to server
  t=2s: Event processed

Reality:
  t=0: Event occurs (user in subway, no signal)
  t=30min: User exits subway, event finally sent
  t=30min+2s: Event processed (30 minutes late!)

The problem:
  Dashboard at t=5min showed: 1,000 events
  Dashboard at t=35min shows: 1,050 events
  User asks: "Why did the number change?"

Solutions:
  1. WATERMARKS: "Process events up to X minutes late"
  2. REPROCESSING: Recompute historical windows
  3. VERSIONING: Show "as of" timestamp with data
  4. CORRECTIONS: Separate late data into adjustment tables

5. Query Optimization

QUERY OPTIMIZATION STRATEGIES

PROBLEM:
  SELECT date, SUM(revenue)
  FROM sales
  WHERE date BETWEEN '2024-01-01' AND '2024-12-31'
  GROUP BY date

With 1B rows, this takes 60 seconds. Unacceptable for dashboard.

SOLUTIONS:

1. PARTITIONING
   Store data partitioned by date
   Query only reads relevant partitions
   1B rows → 3M rows for single day

2. PRE-AGGREGATION
   Maintain daily_revenue table
   Updated by streaming/batch job
   Query reads 365 pre-computed rows

3. COLUMNAR STORAGE
   Only read 'date' and 'revenue' columns
   Ignore other 50 columns
   90% less I/O

4. MATERIALIZED VIEWS
   Precompute common queries
   Refresh periodically or incrementally

5. CACHING
   Cache dashboard queries
   Invalidate on new data

Result: 60 seconds → 200 milliseconds

Connections to Previous Weeks

HOW WEEK 8 BUILDS ON PREVIOUS WEEKS

WEEK 1 (Data at Scale):
├── Partitioning → Partitioned fact tables
├── Replication → Read replicas for queries
└── Storage patterns → Columnar vs row storage

WEEK 2 (Failure-First):
├── Idempotency → Exactly-once event processing
├── Retries → Handling failed batch jobs
└── Circuit breakers → Query timeouts

WEEK 3 (Messaging):
├── Kafka → Event streaming backbone
├── Backpressure → Handling ingestion spikes
└── Dead letters → Failed event processing

WEEK 4 (Caching):
├── Cache strategies → Dashboard query caching
├── Invalidation → Cache refresh on new data
└── TTL → Time-based cache expiration

WEEK 7 (Search):
├── Indexing → Pre-aggregated indices
├── Query optimization → OLAP query patterns
└── Cluster operations → Data warehouse ops

Technologies We'll Discuss

TECHNOLOGY LANDSCAPE

EVENT STREAMING:
├── Apache Kafka (industry standard)
├── AWS Kinesis (managed)
├── Apache Pulsar (emerging)
└── Google Pub/Sub (GCP)

STREAM PROCESSING:
├── Apache Flink (most powerful)
├── Kafka Streams (simpler, Kafka-native)
├── Apache Spark Streaming (batch-oriented)
└── AWS Kinesis Analytics (managed)

BATCH PROCESSING:
├── Apache Spark (industry standard)
├── Google Dataflow/Beam (unified batch+stream)
├── dbt (SQL-based transformations)
└── AWS Glue (managed ETL)

DATA WAREHOUSES:
├── Snowflake (cloud-native, popular)
├── Google BigQuery (serverless, powerful)
├── Amazon Redshift (AWS native)
├── Databricks (Spark-based)
└── ClickHouse (open source, fast)

REAL-TIME OLAP:
├── Apache Druid (real-time analytics)
├── ClickHouse (fast queries)
├── Apache Pinot (LinkedIn-created)
└── StarRocks (emerging)

DATA LAKE:
├── Delta Lake (ACID on S3)
├── Apache Iceberg (table format)
├── Apache Hudi (streaming to lake)
└── S3/GCS + Parquet (basic)

Week 8 Daily Schedule

Day Theme Key Deliverable
Day 1 Event Ingestion Reliable event capture at 100K/sec
Day 2 Stream vs Batch Lambda/Kappa architecture decision
Day 3 Data Modeling Star schema for e-commerce analytics
Day 4 Late Data Watermarks and reprocessing strategy
Day 5 Query Layer Sub-second dashboard queries
Capstone Streaming Analytics Complete analytics for video platform

Success Criteria

By the end of Week 8, you should be able to:

WEEK 8 LEARNING OUTCOMES

□ Design event schemas that evolve safely
□ Choose between Lambda and Kappa architectures
□ Build data models optimized for analytical queries
□ Handle late-arriving data without breaking dashboards
□ Optimize queries over billions of rows
□ Estimate storage and compute costs for analytics
□ Explain trade-offs between real-time and batch accuracy
□ Design multi-tenant analytics platforms
□ Debug "the numbers don't match" problems
□ Interview confidently on analytics system design

Getting Ready

BEFORE STARTING WEEK 8

Concepts to refresh:
├── Kafka basics (topics, partitions, consumer groups)
├── SQL aggregations (GROUP BY, window functions)
├── Distributed computing basics (map-reduce mental model)
└── Cloud storage (S3, object storage patterns)

Mental models to have:
├── Write path vs Read path optimization
├── Storage vs Compute trade-offs
├── Consistency vs Latency trade-offs
└── Accuracy vs Speed trade-offs

Questions to ponder:
├── "Why can't I just use PostgreSQL for analytics?"
├── "Why do companies have separate OLTP and OLAP systems?"
├── "How does Google Analytics work at scale?"
└── "Why are data warehouses so expensive?"

Let's Begin!

WEEK 8 STARTS NOW

"Data is the new oil, but like oil,
 it's useless if you can't refine it."

We're building the refinery.

Day 1: Event Ingestion at Scale
"Capturing everything without dropping anything"

Week 8 Preview Complete

Next: Day 1 — Event Ingestion at Scale