Week Preview
Week 8 Preview: Analytics Pipeline
System Design Mastery Series — Building Blocks Week
Week Overview
┌────────────────────────────────────────────────────────────────────────────┐
│ │
│ WEEK 8: ANALYTICS PIPELINE │
│ │
│ "From Raw Events to Actionable Insights" │
│ │
│ You've built systems that SERVE data (search, APIs, notifications). │
│ Now you'll build systems that UNDERSTAND data. │
│ │
│ This week: Design the infrastructure that powers dashboards, │
│ business intelligence, ML features, and data-driven decisions. │
│ │
└────────────────────────────────────────────────────────────────────────────┘
The Problem We're Solving
THE ANALYTICS CHALLENGE
Your e-commerce platform generates events:
├── Page views: 50M/day
├── Clicks: 20M/day
├── Searches: 10M/day
├── Add to cart: 5M/day
├── Purchases: 500K/day
├── User actions: 100M/day total
Business needs answers:
├── "What's our conversion rate this hour?"
├── "Which products are trending right now?"
├── "What's the average order value by region?"
├── "Show me a funnel from search → view → cart → purchase"
├── "Which marketing campaigns drove revenue yesterday?"
├── "What's our real-time revenue for the flash sale?"
The challenge:
├── Volume: 100M+ events/day
├── Velocity: 1,000+ events/second
├── Variety: Different event types, schemas evolving
├── Latency: Some queries need real-time, others can wait
├── Accuracy: Business decisions depend on correct numbers
├── Cost: Can't afford to store and query everything naively
This week teaches you to build the pipeline that makes this possible.
What You'll Learn
Day 1: Event Ingestion at Scale
THEME: "Capturing everything without dropping anything"
TOPICS:
├── Event schema design and evolution
├── Ingestion patterns (push vs pull)
├── Kafka as the event backbone
├── Handling ingestion spikes
├── Event validation and enrichment
├── Exactly-once delivery guarantees
KEY QUESTION:
"How do you reliably capture 100M events/day
without losing data or blocking producers?"
SYSTEM DESIGN:
Design an event ingestion layer for a ride-sharing app
tracking rides, driver locations, payments, and ratings.
Day 2: Streaming vs Batch Processing
THEME: "Real-time insights vs accurate aggregates"
TOPICS:
├── Lambda Architecture (batch + streaming)
├── Kappa Architecture (streaming only)
├── Stream processing with Flink/Spark Streaming
├── Batch processing with Spark
├── When to use which approach
├── Handling state in streaming
KEY QUESTION:
"When do you need real-time dashboards
vs when can you wait for hourly batch jobs?"
SYSTEM DESIGN:
Build both real-time and batch pipelines for
computing metrics like DAU, conversion rate, and revenue.
Day 3: Data Modeling for Analytics
THEME: "Designing for queries, not transactions"
TOPICS:
├── Star schema and snowflake schema
├── Fact tables and dimension tables
├── Slowly changing dimensions (SCD)
├── Pre-aggregation strategies
├── Columnar storage (Parquet, ORC)
├── Partitioning for query performance
KEY QUESTION:
"How do you model data so a query across
1 billion rows returns in under 5 seconds?"
SYSTEM DESIGN:
Design the data warehouse schema for an
e-commerce analytics platform.
Day 4: Late-Arriving Data and Correctness
THEME: "When 'real-time' isn't really real-time"
TOPICS:
├── Event time vs processing time
├── Watermarks and allowed lateness
├── Windowing strategies
├── Handling out-of-order events
├── Reprocessing and backfill
├── Exactly-once semantics in practice
KEY QUESTION:
"What happens when an event from 2 hours ago
suddenly arrives? How do you update the dashboard?"
SYSTEM DESIGN:
Handle late-arriving mobile app events with
offline usage and poor connectivity.
Day 5: Query Layer and Optimization
THEME: "Making big data feel small"
TOPICS:
├── OLAP databases (ClickHouse, Druid, BigQuery)
├── Materialized views and rollups
├── Query optimization techniques
├── Caching strategies for dashboards
├── Cost management (storage vs compute)
├── Multi-tenancy in analytics
KEY QUESTION:
"How do you make a dashboard query over 1 year
of data return in milliseconds?"
SYSTEM DESIGN:
Build the query layer for a multi-tenant
SaaS analytics platform.
Capstone: Real-Time Analytics for a Streaming Platform
SCENARIO:
Design the analytics infrastructure for a video streaming
platform (like Netflix or Hotstar).
REQUIREMENTS:
├── Track video starts, pauses, seeks, completions
├── Real-time "what's trending now" dashboard
├── Historical viewing patterns for recommendations
├── Content performance metrics for creators
├── A/B test analysis for UI experiments
├── Handle 10M concurrent viewers during cricket finals
This capstone integrates:
├── Event ingestion at massive scale
├── Real-time streaming aggregations
├── Batch processing for historical analysis
├── Data modeling for flexible queries
├── Late-arriving data from mobile apps
├── Query optimization for dashboards
Why This Week Matters
ANALYTICS IS EVERYWHERE
Every modern system needs analytics:
├── E-commerce: Conversion funnels, revenue tracking
├── Social Media: Engagement metrics, content performance
├── FinTech: Transaction monitoring, fraud detection
├── Gaming: Player behavior, monetization
├── SaaS: Usage metrics, churn prediction
├── Streaming: Viewing patterns, recommendations
As a backend engineer, you'll either:
├── Build analytics pipelines
├── Integrate with analytics systems
├── Debug why "the numbers don't match"
└── Optimize queries that are too slow
This week gives you the mental models to handle all of these.
Architecture Overview
┌───────────────────────────────────────────────────────────────────────────┐
│ ANALYTICS PIPELINE ARCHITECTURE │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ DATA SOURCES │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Web │ │ Mobile │ │ API │ │ IoT │ │ DB │ │ │
│ │ │ Events │ │ SDK │ │ Logs │ │ Sensors │ │ CDC │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ └───────┼────────────┼────────────┼────────────┼────────────┼────────┘ │
│ │ │ │ │ │ │
│ └────────────┴─────┬──────┴────────────┴────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ INGESTION LAYER │ │
│ │ (Kafka / Kinesis) │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ Raw Events Topic │ Validated Topic │ Dead Letter Queue │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────┬───────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────┼─────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ STREAMING │ │ BATCH │ │ RAW LAKE │ │
│ │ PROCESSOR │ │ PROCESSOR │ │ (S3/GCS) │ │
│ │ (Flink/KS) │ │ (Spark) │ │ Parquet │ │
│ └───────┬───────┘ └───────┬───────┘ └───────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ REAL-TIME │ │ WAREHOUSE │ │
│ │ STORE │ │ (Snowflake/ │ │
│ │ (Druid/ │ │ BigQuery) │ │
│ │ ClickHouse) │ │ │ │
│ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │
│ └─────────┬──────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ QUERY LAYER │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Dashboards │ │ BI Tools │ │ ML Pipelines│ │ │
│ │ │ (Grafana) │ │ (Tableau) │ │ (Features) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────┘
Key Concepts We'll Cover
1. Event Schema Design
EVENT SCHEMA PRINCIPLES
Every event needs:
├── event_id: Unique identifier (for deduplication)
├── event_type: What happened
├── event_time: When it happened (client timestamp)
├── received_time: When we received it (server timestamp)
├── user_id: Who did it (if known)
├── session_id: Which session
├── device_info: Platform, version, etc.
├── payload: Event-specific data
Schema evolution:
├── Add fields: Safe (default values)
├── Remove fields: Dangerous (break consumers)
├── Rename fields: Very dangerous
└── Solution: Schema registry, versioning
2. Lambda vs Kappa Architecture
LAMBDA ARCHITECTURE
┌─────────────────┐
│ Raw Events │
└────────┬────────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Batch │ │ Speed │
│ Layer │ │ Layer │
│ (Spark) │ │ (Flink) │
└────┬─────┘ └────┬─────┘
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Batch │ │ Real- │
│ Views │ │ Time │
│ (hourly) │ │ Views │
└────┬─────┘ └────┬─────┘
│ │
└──────────┬──────────┘
▼
┌──────────┐
│ Query │
│ Layer │
└──────────┘
Pros: Accurate batch, fast real-time
Cons: Two codebases, complexity
KAPPA ARCHITECTURE
┌─────────────────┐
│ Raw Events │
│ (Immutable) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Streaming │
│ Processor │
│ (Flink) │
└────────┬────────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Real- │ │ Serving │
│ Time │ │ Layer │
│ Views │ │ (tables) │
└──────────┘ └──────────┘
Pros: Single codebase, simpler
Cons: Reprocessing is harder
3. Star Schema
STAR SCHEMA FOR E-COMMERCE
┌─────────────────┐
│ dim_product │
│ ───────────── │
│ product_id │
│ name │
│ category │
│ brand │
│ price │
└────────┬────────┘
│
┌─────────────────┐ │ ┌─────────────────┐
│ dim_time │ │ │ dim_customer │
│ ───────────── │ │ │ ───────────── │
│ time_id │ │ │ customer_id │
│ date │ │ │ name │
│ hour │ ┌─────┴─────┐ │ segment │
│ day_of_week │◄───┤fact_sales │───►│ region │
│ month │ ├───────────┤ │ signup_date │
│ quarter │ │ sale_id │ └─────────────────┘
│ year │ │ time_id │
│ is_weekend │ │ product_id│ ┌─────────────────┐
└─────────────────┘ │customer_id│ │ dim_location │
│location_id│──────────│ ───────────── │
│ quantity │ │ location_id │
│ revenue │ │ city │
│ discount │ │ state │
│ profit │ │ country │
└───────────┘ │ region │
└─────────────────┘
FACT TABLE: Measures (revenue, quantity)
DIMENSION TABLES: Context (who, what, when, where)
Query: "Revenue by category by month"
JOIN: fact_sales → dim_product → dim_time
4. Handling Late Data
LATE-ARRIVING DATA PROBLEM
Expected timeline:
t=0: Event occurs on mobile app
t=1s: Event sent to server
t=2s: Event processed
Reality:
t=0: Event occurs (user in subway, no signal)
t=30min: User exits subway, event finally sent
t=30min+2s: Event processed (30 minutes late!)
The problem:
Dashboard at t=5min showed: 1,000 events
Dashboard at t=35min shows: 1,050 events
User asks: "Why did the number change?"
Solutions:
1. WATERMARKS: "Process events up to X minutes late"
2. REPROCESSING: Recompute historical windows
3. VERSIONING: Show "as of" timestamp with data
4. CORRECTIONS: Separate late data into adjustment tables
5. Query Optimization
QUERY OPTIMIZATION STRATEGIES
PROBLEM:
SELECT date, SUM(revenue)
FROM sales
WHERE date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY date
With 1B rows, this takes 60 seconds. Unacceptable for dashboard.
SOLUTIONS:
1. PARTITIONING
Store data partitioned by date
Query only reads relevant partitions
1B rows → 3M rows for single day
2. PRE-AGGREGATION
Maintain daily_revenue table
Updated by streaming/batch job
Query reads 365 pre-computed rows
3. COLUMNAR STORAGE
Only read 'date' and 'revenue' columns
Ignore other 50 columns
90% less I/O
4. MATERIALIZED VIEWS
Precompute common queries
Refresh periodically or incrementally
5. CACHING
Cache dashboard queries
Invalidate on new data
Result: 60 seconds → 200 milliseconds
Connections to Previous Weeks
HOW WEEK 8 BUILDS ON PREVIOUS WEEKS
WEEK 1 (Data at Scale):
├── Partitioning → Partitioned fact tables
├── Replication → Read replicas for queries
└── Storage patterns → Columnar vs row storage
WEEK 2 (Failure-First):
├── Idempotency → Exactly-once event processing
├── Retries → Handling failed batch jobs
└── Circuit breakers → Query timeouts
WEEK 3 (Messaging):
├── Kafka → Event streaming backbone
├── Backpressure → Handling ingestion spikes
└── Dead letters → Failed event processing
WEEK 4 (Caching):
├── Cache strategies → Dashboard query caching
├── Invalidation → Cache refresh on new data
└── TTL → Time-based cache expiration
WEEK 7 (Search):
├── Indexing → Pre-aggregated indices
├── Query optimization → OLAP query patterns
└── Cluster operations → Data warehouse ops
Technologies We'll Discuss
TECHNOLOGY LANDSCAPE
EVENT STREAMING:
├── Apache Kafka (industry standard)
├── AWS Kinesis (managed)
├── Apache Pulsar (emerging)
└── Google Pub/Sub (GCP)
STREAM PROCESSING:
├── Apache Flink (most powerful)
├── Kafka Streams (simpler, Kafka-native)
├── Apache Spark Streaming (batch-oriented)
└── AWS Kinesis Analytics (managed)
BATCH PROCESSING:
├── Apache Spark (industry standard)
├── Google Dataflow/Beam (unified batch+stream)
├── dbt (SQL-based transformations)
└── AWS Glue (managed ETL)
DATA WAREHOUSES:
├── Snowflake (cloud-native, popular)
├── Google BigQuery (serverless, powerful)
├── Amazon Redshift (AWS native)
├── Databricks (Spark-based)
└── ClickHouse (open source, fast)
REAL-TIME OLAP:
├── Apache Druid (real-time analytics)
├── ClickHouse (fast queries)
├── Apache Pinot (LinkedIn-created)
└── StarRocks (emerging)
DATA LAKE:
├── Delta Lake (ACID on S3)
├── Apache Iceberg (table format)
├── Apache Hudi (streaming to lake)
└── S3/GCS + Parquet (basic)
Week 8 Daily Schedule
| Day | Theme | Key Deliverable |
|---|---|---|
| Day 1 | Event Ingestion | Reliable event capture at 100K/sec |
| Day 2 | Stream vs Batch | Lambda/Kappa architecture decision |
| Day 3 | Data Modeling | Star schema for e-commerce analytics |
| Day 4 | Late Data | Watermarks and reprocessing strategy |
| Day 5 | Query Layer | Sub-second dashboard queries |
| Capstone | Streaming Analytics | Complete analytics for video platform |
Success Criteria
By the end of Week 8, you should be able to:
WEEK 8 LEARNING OUTCOMES
□ Design event schemas that evolve safely
□ Choose between Lambda and Kappa architectures
□ Build data models optimized for analytical queries
□ Handle late-arriving data without breaking dashboards
□ Optimize queries over billions of rows
□ Estimate storage and compute costs for analytics
□ Explain trade-offs between real-time and batch accuracy
□ Design multi-tenant analytics platforms
□ Debug "the numbers don't match" problems
□ Interview confidently on analytics system design
Getting Ready
BEFORE STARTING WEEK 8
Concepts to refresh:
├── Kafka basics (topics, partitions, consumer groups)
├── SQL aggregations (GROUP BY, window functions)
├── Distributed computing basics (map-reduce mental model)
└── Cloud storage (S3, object storage patterns)
Mental models to have:
├── Write path vs Read path optimization
├── Storage vs Compute trade-offs
├── Consistency vs Latency trade-offs
└── Accuracy vs Speed trade-offs
Questions to ponder:
├── "Why can't I just use PostgreSQL for analytics?"
├── "Why do companies have separate OLTP and OLAP systems?"
├── "How does Google Analytics work at scale?"
└── "Why are data warehouses so expensive?"
Let's Begin!
WEEK 8 STARTS NOW
"Data is the new oil, but like oil,
it's useless if you can't refine it."
We're building the refinery.
Day 1: Event Ingestion at Scale
"Capturing everything without dropping anything"
Week 8 Preview Complete
Next: Day 1 — Event Ingestion at Scale