Week 1 Preview: Data at Scale — Storage Trade-offs
System Design Mastery Series
Welcome to Week 1
This week marks the beginning of your transformation from an engineer who uses databases to one who designs storage systems. By Friday, you'll think differently about every database decision you make.
The Big Picture
What This Week Is About
Every backend system you'll ever build has one fundamental problem: where does the data live, and how do you access it efficiently?
This sounds simple until you have:
- 100 million records
- 10,000 requests per second
- Users on three continents
- A requirement for 99.99% uptime
Suddenly, "just use PostgreSQL" isn't an answer—it's the beginning of a hundred questions.
This week, we answer those questions.
The Mindset Shift
Before this week, you might think:
"We need a database. Should we use PostgreSQL or MongoDB?"
After this week, you'll think:
"What are our access patterns? What consistency do we actually need? How will the data grow? What happens when a node fails? Given all that, here's how we should store and partition this data."
That's the difference between a developer and a systems engineer.
What You'll Learn
Core Concepts
1. Partitioning (Sharding)
When one database isn't enough, you split data across multiple machines. But how you split determines everything:
| Strategy | Best For | Breaks When |
|---|---|---|
| Hash Partitioning | Even distribution, key-value lookups | You need range queries |
| Range Partitioning | Time-series, range scans | All writes hit one partition (hot spot) |
| Directory-Based | Maximum flexibility | Lookup service becomes a bottleneck |
You'll learn when each fails and how to choose.
2. Replication
Copies of data for availability and read scaling. Sounds simple, but:
- Synchronous replication: Strong consistency, but one slow replica slows everyone
- Asynchronous replication: Fast writes, but replicas can serve stale data
You'll design systems that handle the trade-offs explicitly.
3. Consistency Models
"Consistency" means different things:
- Strong consistency: Every read sees the latest write
- Eventual consistency: Reads might be stale, but will converge
- Read-your-writes: You see your own writes immediately (others might not)
You'll learn which systems need which level—and why "eventual" is often fine.
4. Hot Keys and Skew
In the real world, data isn't uniform:
- 0.01% of URLs get 90% of traffic
- Celebrity accounts have millions of followers
- Black Friday creates 100x normal load on specific products
You'll design systems that don't fall over when one key goes viral.
5. Consistent Hashing
The elegant solution to "what happens when we add or remove a database server?" You'll understand:
- Why naive modulo hashing causes massive data movement
- How consistent hashing minimizes resharding pain
- When it's worth the complexity (and when it's not)
The Systems You'll Design
This week, you'll design three production-grade systems. These aren't toy examples—they're systems that exist at companies like Bitly, Cloudflare, and Netflix.
System 1: URL Shortener (Days 1-2, 4)
Scale: 100 million URL creations per day, 1 billion redirects per day
You'll solve:
- How to generate unique short codes at scale
- Where to store the mappings (partition strategy)
- How to handle viral URLs (1M hits/sec on one short code)
- Read replica architecture for global latency
Real-world parallel: Bitly, TinyURL, Twitter's t.co
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ Service │────▶│ Partition │
│ │ │ (Routing) │ │ Cluster │
└─────────────┘ └──────┬──────┘ └─────────────┘
│
┌──────▼──────┐
│ Cache │
│ (Redis) │
└─────────────┘
System 2: Distributed Rate Limiter (Day 3)
Scale: 10,000 requests per second across 50 servers, per-user and per-API limits
You'll solve:
- Sliding window vs token bucket vs leaky bucket algorithms
- How to rate limit when state is distributed
- What happens when your rate limit store fails
- The consistency vs availability trade-off (over-limit vs under-limit)
Real-world parallel: Cloudflare, AWS API Gateway, Stripe rate limiting
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Request 1 │────▶│ │ │ │
├─────────────┤ │ Rate │────▶│ Redis │
│ Request 2 │────▶│ Limiter │ │ Cluster │
├─────────────┤ │ Layer │ │ │
│ Request 3 │────▶│ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
│
Allow/Deny Decision
System 3: User Session Store (Day 5)
Scale: 10 million concurrent sessions, cross-datacenter failover
You'll solve:
- Sticky sessions vs distributed sessions
- Session TTL and cleanup strategies
- What happens to sessions during datacenter failover
- Choosing between Redis Cluster, DynamoDB, and custom solutions
Real-world parallel: Netflix session management, Auth0, any large web application
┌─────────────────────────────────────────────────────┐
│ Load Balancer │
└───────────────────────┬─────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ App 1 │ │ App 2 │ │ App 3 │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└───────────────┼───────────────┘
▼
┌─────────────────┐
│ Session Store │
│ (Distributed) │
└─────────────────┘
Daily Breakdown
Day 1: Partitioning Deep-Dive
Theme: How to split data across machines
You'll learn:
- Hash partitioning: even distribution, but no range queries
- Range partitioning: great for time-series, but creates hot spots
- Directory-based: flexible, but adds a lookup dependency
- Consistent hashing: minimizes data movement during scaling
You'll design: URL shortener core architecture
The challenge question: "What if one short URL goes viral and gets 1M hits/sec?"
By end of day: You can choose a partition strategy based on access patterns, not gut feeling.
Day 2: Replication Trade-offs
Theme: Copies of data—for availability, for read scaling, for disaster recovery
You'll learn:
- Synchronous vs asynchronous replication
- Leader-follower vs multi-leader vs leaderless architectures
- Replication lag and its consequences
- Read-your-writes consistency and how to achieve it
You'll design: Read replica architecture for the URL shortener
The challenge question: "User creates a URL, immediately shares it, friend clicks—what happens with async replication?"
By end of day: You can explain exactly what trade-offs you're making with any replication strategy.
Day 3: Rate Limiting at Scale
Theme: Controlling traffic without becoming a bottleneck yourself
You'll learn:
- Token bucket algorithm: smooth traffic, allows bursts
- Leaky bucket algorithm: strict rate, queues excess
- Sliding window: accurate counting, memory trade-off
- Distributed rate limiting: the coordination problem
You'll design: Rate limiter for 10K req/sec across 50 servers
The challenge question: "One Redis node dies—what's the behavior? Over-allow or under-allow?"
By end of day: You can design a rate limiter that degrades gracefully under failure.
Day 4: Hot Keys and Skew
Theme: When 0.01% of your data gets 90% of your traffic
You'll learn:
- Why hot keys happen (Zipf's law, viral content, celebrity effect)
- Detection: metrics, anomaly detection, real-time monitoring
- Mitigation: caching, key splitting, replica routing
- Prevention: design patterns that avoid hot spots
You'll design: Redesigned URL shortener for viral URLs
The discussion: How does Instagram handle hot celebrity posts?
By end of day: You can design systems that don't fall over when one key goes viral.
Day 5: Session Store Design
Theme: Putting it all together with a complete system design
You'll learn:
- Sticky sessions: simple, but limits scaling and failover
- Distributed sessions: complex, but true horizontal scaling
- Session data modeling: what to store, TTL strategies
- Datacenter failover: what happens to sessions?
You'll design: Session store for 10M concurrent users
The challenge question: "Datacenter failover happens—what's the user experience?"
Deliverable: A written decision document choosing between Redis Cluster, DynamoDB, and a custom solution.
By end of day: You've designed a complete stateful system with explicit failure handling.
Key Questions You'll Answer This Week
By Friday, you should be able to confidently answer:
Partitioning
- When would you use hash partitioning vs range partitioning?
- What is consistent hashing and when is it worth the complexity?
- How do you handle a hot partition?
- What happens to data when you add or remove a partition?
Replication
- What's the difference between sync and async replication?
- When is eventual consistency acceptable? When is it not?
- How do you achieve read-your-writes consistency?
- What happens during leader failover?
Rate Limiting
- Token bucket vs sliding window—when to use each?
- How do you rate limit across multiple servers?
- Should you over-allow or under-allow during failures?
Hot Keys
- How do you detect hot keys in real-time?
- What are three strategies for mitigating hot keys?
- How would you design for a 1000x traffic spike on one key?
Sessions
- When are sticky sessions acceptable?
- What happens to sessions during datacenter failover?
- How do you choose session TTL?
Prerequisites Check
Before starting, make sure you're comfortable with:
Must Know
- Basic SQL and NoSQL concepts (you use databases daily)
- REST API design (you've built APIs)
- Basic networking (HTTP, TCP, DNS at a high level)
- Python/FastAPI (for pseudocode examples)
Helpful But Not Required
- Redis basics (we'll cover what you need)
- Distributed systems theory (we'll build intuition)
- Any specific database internals (we'll explain as we go)
Not Needed
- Academic distributed systems (Paxos, Raft internals)
- Specific cloud provider expertise
- Previous system design interview experience
How to Use the Daily Documents
Each day's document follows the same structure:
Part I: Foundations
└── Core concepts explained with depth
Part II: The Design Challenge
└── Applying concepts to a real system
Part III: Advanced Topics
└── Deeper dives for the curious
Part IV: Discussion and Trade-offs
└── The hard questions you should ask
Part V: Interview Questions
└── Real interview scenarios with strong answers
Exercises
└── Practice problems
Appendix: Code Reference
└── Working implementations
For the 1-hour session:
- 0-10 min: Read Part I together, discuss
- 10-45 min: Work through Part II, one person designs, other challenges
- 45-60 min: Discuss Part IV questions, document decisions
For deeper study (optional):
- Read Part III and Part V on your own
- Try the exercises
- Reference the code appendix when implementing
The Pair Learning Format
This week works best with two people. Here's why:
Designer Role
- Makes decisions and defends them
- Draws architecture diagrams
- Proposes solutions to challenges
Challenger Role
- Asks "what if X fails?"
- Injects scale: "now handle 10x traffic"
- Questions every assumption: "why not use Y instead?"
Rotate daily. If you designed on Monday, challenge on Tuesday.
Why This Works
- You learn by teaching: Explaining your design reveals gaps in understanding
- Adversarial thinking: Real systems fail; your partner simulates that
- Trade-off practice: Disagreement forces explicit trade-off discussions
- Interview prep: This is exactly how system design interviews work
What Success Looks Like
By end of Week 1, you will:
Understand deeply:
- Why partitioning is necessary and how to choose strategies
- The replication consistency spectrum and where you fall on it
- Rate limiting algorithms and their trade-offs
- Hot key detection and mitigation patterns
Have designed:
- A URL shortener handling 100M creates/day
- A distributed rate limiter across 50 servers
- A session store for 10M concurrent users
Have documented:
- Partition strategy decisions with justifications
- Replication choices with explicit consistency trade-offs
- Rate limiter failure mode handling
- Session store technology decision document
Be able to answer:
- 15+ interview questions on partitioning and replication
- "Design X" questions with structured, trade-off-aware responses
- "What happens when Y fails?" questions with concrete answers
Common Pitfalls to Avoid
Pitfall 1: Jumping to Solutions
Wrong: "We'll use Redis Cluster." Right: "Our access pattern is X, our consistency requirement is Y, our scale is Z. Given that, Redis Cluster makes sense because..."
Pitfall 2: Ignoring Failures
Wrong: "The data is replicated, so we're good." Right: "When the leader fails, here's what happens. During failover, these requests will fail. Our SLA allows for that."
Pitfall 3: Over-Engineering
Wrong: "We need consistent hashing with virtual nodes and anti-entropy protocols." Right: "At our current scale, simple modulo hashing works. Here's the trigger for when we'd need consistent hashing."
Pitfall 4: Under-Engineering
Wrong: "We'll just use one big PostgreSQL instance." Right: "Single node works for now. Here's the scale at which we'd need to partition, and here's the partition key we'd use."
Resources for This Week
Required
- Daily documents (Day 1-5)
- Shared whiteboard (Excalidraw, Miro, or physical)
- Note-taking space for decisions
Recommended Reading
- "Designing Data-Intensive Applications" Chapters 5-6 (Replication, Partitioning)
- Bitly engineering blog on URL shortener architecture
- Cloudflare blog on rate limiting at scale
Optional Deep Dives
- Amazon DynamoDB paper (consistent hashing in practice)
- Discord blog on message storage
- Stripe blog on rate limiting
Let's Begin
Day 1 starts with the most fundamental question in distributed storage:
How do you split data across multiple machines?
The answer determines your system's performance, availability, and operational complexity for years to come.
Open Day 1: Partitioning Deep-Dive. Let's go.
Quick Reference: Week 1 at a Glance
| Day | Topic | System | Key Question |
|---|---|---|---|
| 1 | Partitioning | URL Shortener | How do you split data across machines? |
| 2 | Replication | URL Shortener (extended) | What do you trade for copies of data? |
| 3 | Rate Limiting | Distributed Rate Limiter | How do you limit without becoming a bottleneck? |
| 4 | Hot Keys | URL Shortener (redesigned) | What happens when one key gets all the traffic? |
| 5 | Session Store | User Session Store | How do you tie it all together? |
Next: Day 1 — Partitioning Deep-Dive