Week Preview

Week 1 Preview: Data at Scale — Storage Trade-offs

System Design Mastery Series

Welcome to Week 1

This week marks the beginning of your transformation from an engineer who uses databases to one who designs storage systems. By Friday, you'll think differently about every database decision you make.

The Big Picture

What This Week Is About

Every backend system you'll ever build has one fundamental problem: where does the data live, and how do you access it efficiently?

This sounds simple until you have:

100 million records
10,000 requests per second
Users on three continents
A requirement for 99.99% uptime

Suddenly, "just use PostgreSQL" isn't an answer—it's the beginning of a hundred questions.

This week, we answer those questions.

The Mindset Shift

Before this week, you might think:

"We need a database. Should we use PostgreSQL or MongoDB?"

After this week, you'll think:

"What are our access patterns? What consistency do we actually need? How will the data grow? What happens when a node fails? Given all that, here's how we should store and partition this data."

That's the difference between a developer and a systems engineer.

What You'll Learn

Core Concepts

1. Partitioning (Sharding)

When one database isn't enough, you split data across multiple machines. But how you split determines everything:

Strategy	Best For	Breaks When
Hash Partitioning	Even distribution, key-value lookups	You need range queries
Range Partitioning	Time-series, range scans	All writes hit one partition (hot spot)
Directory-Based	Maximum flexibility	Lookup service becomes a bottleneck

You'll learn when each fails and how to choose.

2. Replication

Copies of data for availability and read scaling. Sounds simple, but:

Synchronous replication: Strong consistency, but one slow replica slows everyone
Asynchronous replication: Fast writes, but replicas can serve stale data

You'll design systems that handle the trade-offs explicitly.

3. Consistency Models

"Consistency" means different things:

Strong consistency: Every read sees the latest write
Eventual consistency: Reads might be stale, but will converge
Read-your-writes: You see your own writes immediately (others might not)

You'll learn which systems need which level—and why "eventual" is often fine.

4. Hot Keys and Skew

In the real world, data isn't uniform:

0.01% of URLs get 90% of traffic
Celebrity accounts have millions of followers
Black Friday creates 100x normal load on specific products

You'll design systems that don't fall over when one key goes viral.

5. Consistent Hashing

The elegant solution to "what happens when we add or remove a database server?" You'll understand:

Why naive modulo hashing causes massive data movement
How consistent hashing minimizes resharding pain
When it's worth the complexity (and when it's not)

The Systems You'll Design

This week, you'll design three production-grade systems. These aren't toy examples—they're systems that exist at companies like Bitly, Cloudflare, and Netflix.

System 1: URL Shortener (Days 1-2, 4)

Scale: 100 million URL creations per day, 1 billion redirects per day

You'll solve:

How to generate unique short codes at scale
Where to store the mappings (partition strategy)
How to handle viral URLs (1M hits/sec on one short code)
Read replica architecture for global latency

Real-world parallel: Bitly, TinyURL, Twitter's t.co

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│   Service   │────▶│  Partition  │
│             │     │  (Routing)  │     │   Cluster   │
└─────────────┘     └──────┬──────┘     └─────────────┘
                           │
                    ┌──────▼──────┐
                    │    Cache    │
                    │   (Redis)   │
                    └─────────────┘

System 2: Distributed Rate Limiter (Day 3)

Scale: 10,000 requests per second across 50 servers, per-user and per-API limits

You'll solve:

Sliding window vs token bucket vs leaky bucket algorithms
How to rate limit when state is distributed
What happens when your rate limit store fails
The consistency vs availability trade-off (over-limit vs under-limit)

Real-world parallel: Cloudflare, AWS API Gateway, Stripe rate limiting

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Request 1  │────▶│             │     │             │
├─────────────┤     │   Rate      │────▶│   Redis     │
│  Request 2  │────▶│   Limiter   │     │   Cluster   │
├─────────────┤     │   Layer     │     │             │
│  Request 3  │────▶│             │     │             │
└─────────────┘     └─────────────┘     └─────────────┘
                           │
                    Allow/Deny Decision

System 3: User Session Store (Day 5)

Scale: 10 million concurrent sessions, cross-datacenter failover

You'll solve:

Sticky sessions vs distributed sessions
Session TTL and cleanup strategies
What happens to sessions during datacenter failover
Choosing between Redis Cluster, DynamoDB, and custom solutions

Real-world parallel: Netflix session management, Auth0, any large web application

┌─────────────────────────────────────────────────────┐
│                    Load Balancer                    │
└───────────────────────┬─────────────────────────────┘
                        │
        ┌───────────────┼───────────────┐
        ▼               ▼               ▼
   ┌─────────┐     ┌─────────┐     ┌─────────┐
   │ App 1   │     │ App 2   │     │ App 3   │
   └────┬────┘     └────┬────┘     └────┬────┘
        │               │               │
        └───────────────┼───────────────┘
                        ▼
              ┌─────────────────┐
              │  Session Store  │
              │ (Distributed)   │
              └─────────────────┘

Daily Breakdown

Day 1: Partitioning Deep-Dive

Theme: How to split data across machines

You'll learn:

Hash partitioning: even distribution, but no range queries
Range partitioning: great for time-series, but creates hot spots
Directory-based: flexible, but adds a lookup dependency
Consistent hashing: minimizes data movement during scaling

You'll design: URL shortener core architecture

The challenge question: "What if one short URL goes viral and gets 1M hits/sec?"

By end of day: You can choose a partition strategy based on access patterns, not gut feeling.

Day 2: Replication Trade-offs

Theme: Copies of data—for availability, for read scaling, for disaster recovery

You'll learn:

Synchronous vs asynchronous replication
Leader-follower vs multi-leader vs leaderless architectures
Replication lag and its consequences
Read-your-writes consistency and how to achieve it

You'll design: Read replica architecture for the URL shortener

The challenge question: "User creates a URL, immediately shares it, friend clicks—what happens with async replication?"

By end of day: You can explain exactly what trade-offs you're making with any replication strategy.

Day 3: Rate Limiting at Scale

Theme: Controlling traffic without becoming a bottleneck yourself

You'll learn:

Token bucket algorithm: smooth traffic, allows bursts
Leaky bucket algorithm: strict rate, queues excess
Sliding window: accurate counting, memory trade-off
Distributed rate limiting: the coordination problem

You'll design: Rate limiter for 10K req/sec across 50 servers

The challenge question: "One Redis node dies—what's the behavior? Over-allow or under-allow?"

By end of day: You can design a rate limiter that degrades gracefully under failure.

Day 4: Hot Keys and Skew

Theme: When 0.01% of your data gets 90% of your traffic

You'll learn:

Why hot keys happen (Zipf's law, viral content, celebrity effect)
Detection: metrics, anomaly detection, real-time monitoring
Mitigation: caching, key splitting, replica routing
Prevention: design patterns that avoid hot spots

You'll design: Redesigned URL shortener for viral URLs

The discussion: How does Instagram handle hot celebrity posts?

By end of day: You can design systems that don't fall over when one key goes viral.

Day 5: Session Store Design

Theme: Putting it all together with a complete system design

You'll learn:

Sticky sessions: simple, but limits scaling and failover
Distributed sessions: complex, but true horizontal scaling
Session data modeling: what to store, TTL strategies
Datacenter failover: what happens to sessions?

You'll design: Session store for 10M concurrent users

The challenge question: "Datacenter failover happens—what's the user experience?"

Deliverable: A written decision document choosing between Redis Cluster, DynamoDB, and a custom solution.

By end of day: You've designed a complete stateful system with explicit failure handling.

Key Questions You'll Answer This Week

By Friday, you should be able to confidently answer:

Partitioning

When would you use hash partitioning vs range partitioning?
What is consistent hashing and when is it worth the complexity?
How do you handle a hot partition?
What happens to data when you add or remove a partition?

Replication

What's the difference between sync and async replication?
When is eventual consistency acceptable? When is it not?
How do you achieve read-your-writes consistency?
What happens during leader failover?

Rate Limiting

Token bucket vs sliding window—when to use each?
How do you rate limit across multiple servers?
Should you over-allow or under-allow during failures?

Hot Keys

How do you detect hot keys in real-time?
What are three strategies for mitigating hot keys?
How would you design for a 1000x traffic spike on one key?

Sessions

When are sticky sessions acceptable?
What happens to sessions during datacenter failover?
How do you choose session TTL?

Prerequisites Check

Before starting, make sure you're comfortable with:

Must Know

Basic SQL and NoSQL concepts (you use databases daily)
REST API design (you've built APIs)
Basic networking (HTTP, TCP, DNS at a high level)
Python/FastAPI (for pseudocode examples)

Helpful But Not Required

Redis basics (we'll cover what you need)
Distributed systems theory (we'll build intuition)
Any specific database internals (we'll explain as we go)

Not Needed

Academic distributed systems (Paxos, Raft internals)
Specific cloud provider expertise
Previous system design interview experience

How to Use the Daily Documents

Each day's document follows the same structure:

Part I: Foundations
  └── Core concepts explained with depth

Part II: The Design Challenge  
  └── Applying concepts to a real system

Part III: Advanced Topics
  └── Deeper dives for the curious

Part IV: Discussion and Trade-offs
  └── The hard questions you should ask

Part V: Interview Questions
  └── Real interview scenarios with strong answers

Exercises
  └── Practice problems

Appendix: Code Reference
  └── Working implementations

For the 1-hour session:

0-10 min: Read Part I together, discuss
10-45 min: Work through Part II, one person designs, other challenges
45-60 min: Discuss Part IV questions, document decisions

For deeper study (optional):

Read Part III and Part V on your own
Try the exercises
Reference the code appendix when implementing

The Pair Learning Format

This week works best with two people. Here's why:

Designer Role

Makes decisions and defends them
Draws architecture diagrams
Proposes solutions to challenges

Challenger Role

Asks "what if X fails?"
Injects scale: "now handle 10x traffic"
Questions every assumption: "why not use Y instead?"

Rotate daily. If you designed on Monday, challenge on Tuesday.

Why This Works

You learn by teaching: Explaining your design reveals gaps in understanding
Adversarial thinking: Real systems fail; your partner simulates that
Trade-off practice: Disagreement forces explicit trade-off discussions
Interview prep: This is exactly how system design interviews work

What Success Looks Like

By end of Week 1, you will:

Understand deeply:

Why partitioning is necessary and how to choose strategies
The replication consistency spectrum and where you fall on it
Rate limiting algorithms and their trade-offs
Hot key detection and mitigation patterns

Have designed:

A URL shortener handling 100M creates/day
A distributed rate limiter across 50 servers
A session store for 10M concurrent users

Have documented:

Partition strategy decisions with justifications
Replication choices with explicit consistency trade-offs
Rate limiter failure mode handling
Session store technology decision document

Be able to answer:

15+ interview questions on partitioning and replication
"Design X" questions with structured, trade-off-aware responses
"What happens when Y fails?" questions with concrete answers

Common Pitfalls to Avoid

Pitfall 1: Jumping to Solutions

Wrong: "We'll use Redis Cluster." Right: "Our access pattern is X, our consistency requirement is Y, our scale is Z. Given that, Redis Cluster makes sense because..."

Pitfall 2: Ignoring Failures

Wrong: "The data is replicated, so we're good." Right: "When the leader fails, here's what happens. During failover, these requests will fail. Our SLA allows for that."

Pitfall 3: Over-Engineering

Wrong: "We need consistent hashing with virtual nodes and anti-entropy protocols." Right: "At our current scale, simple modulo hashing works. Here's the trigger for when we'd need consistent hashing."

Pitfall 4: Under-Engineering

Wrong: "We'll just use one big PostgreSQL instance." Right: "Single node works for now. Here's the scale at which we'd need to partition, and here's the partition key we'd use."

Resources for This Week

Required

Daily documents (Day 1-5)
Shared whiteboard (Excalidraw, Miro, or physical)
Note-taking space for decisions

Optional Deep Dives

Amazon DynamoDB paper (consistent hashing in practice)
Discord blog on message storage
Stripe blog on rate limiting

Let's Begin

Day 1 starts with the most fundamental question in distributed storage:

How do you split data across multiple machines?

The answer determines your system's performance, availability, and operational complexity for years to come.

Open Day 1: Partitioning Deep-Dive. Let's go.

Quick Reference: Week 1 at a Glance

Day	Topic	System	Key Question
1	Partitioning	URL Shortener	How do you split data across machines?
2	Replication	URL Shortener (extended)	What do you trade for copies of data?
3	Rate Limiting	Distributed Rate Limiter	How do you limit without becoming a bottleneck?
4	Hot Keys	URL Shortener (redesigned)	What happens when one key gets all the traffic?
5	Session Store	User Session Store	How do you tie it all together?

Next: Day 1 — Partitioning Deep-Dive

Back to Course Overview

Week 1 Preview: Data at Scale — Storage Trade-offs

System Design Mastery Series

Welcome to Week 1

The Big Picture

What This Week Is About

The Mindset Shift

What You'll Learn

Core Concepts

1. Partitioning (Sharding)

2. Replication

3. Consistency Models

4. Hot Keys and Skew

5. Consistent Hashing

The Systems You'll Design

System 1: URL Shortener (Days 1-2, 4)

System 2: Distributed Rate Limiter (Day 3)

System 3: User Session Store (Day 5)

Daily Breakdown

Day 1: Partitioning Deep-Dive

Day 2: Replication Trade-offs

Day 3: Rate Limiting at Scale

Day 4: Hot Keys and Skew

Day 5: Session Store Design

Key Questions You'll Answer This Week

Partitioning

Replication

Rate Limiting

Hot Keys

Sessions

Prerequisites Check

Must Know

Helpful But Not Required

Not Needed

How to Use the Daily Documents

The Pair Learning Format

Designer Role

Challenger Role

Why This Works

What Success Looks Like

By end of Week 1, you will:

Common Pitfalls to Avoid

Pitfall 1: Jumping to Solutions

Pitfall 2: Ignoring Failures

Pitfall 3: Over-Engineering

Pitfall 4: Under-Engineering

Resources for This Week

Required

Recommended Reading

Optional Deep Dives

Let's Begin

Quick Reference: Week 1 at a Glance