Himanshu Kukreja
0%
LearnSystem DesignWeek 1Partitioning Deep Dive
Day 01

Week 1 — Day 1: Partitioning Deep-Dive

System Design Mastery Series


Preface

This is the first day of your 10-week journey from intermediate to advanced system design. Today, we tackle one of the most fundamental decisions you'll make when building systems at scale: how to partition your data.

By the end of this session, you won't just know what partitioning is—you'll understand when each strategy breaks and what trade-offs you're making with every choice.

Let's begin.


Part I: Foundations

Chapter 1: Why Partitioning Matters

1.1 The Single-Node Ceiling

Every database starts simple. One machine, one disk, one process handling all your reads and writes. For a while, this works beautifully. Your PostgreSQL instance handles 1,000 requests per second without breaking a sweat.

Then your startup grows.

Suddenly you're handling 10,000 requests per second. Your single PostgreSQL instance starts sweating. CPU pegged at 100%. Disk I/O becomes a bottleneck. Queries that took 10ms now take 500ms.

You have three options:

  1. Vertical Scaling: Buy a bigger machine. More CPU, more RAM, more disk.
  2. Read Replicas: Copy data to multiple machines, distribute read load.
  3. Partitioning: Split data across multiple machines.

Vertical scaling has limits—there's only so big a machine can get, and costs grow exponentially. Read replicas help with read-heavy workloads but don't solve write bottlenecks.

Partitioning (also called sharding) is how you truly scale writes. You split your data across multiple machines, each responsible for a subset of the data.

1.2 The Core Problem Partitioning Solves

Partitioning addresses three fundamental constraints:

Constraint Single Machine Limit With Partitioning
Storage Disk size (e.g., 10TB) Sum of all partition disks
Throughput CPU/IO bound (e.g., 10K QPS) Aggregate of all partitions
Memory RAM size (e.g., 256GB) Sum of all partition RAM

But partitioning isn't free. The moment you split data across machines, you inherit a new set of problems:

  • How do you decide which data goes where?
  • How do you query data that spans multiple partitions?
  • What happens when you need to add or remove machines?
  • How do you handle "hot" partitions that get more traffic than others?

These are the questions we'll answer today.


Chapter 2: Partitioning Strategies

There are three fundamental approaches to partitioning data. Each has distinct characteristics, and choosing the wrong one can cripple your system.

2.1 Hash Partitioning

How It Works

Take a key (user_id, order_id, url_id), hash it, and use the hash to determine which partition owns that key.

partition = hash(key) % number_of_partitions

For example, with 4 partitions:

hash("user_123") = 847291 → 847291 % 4 = 3 → Partition 3
hash("user_456") = 291847 → 291847 % 4 = 3 → Partition 3
hash("user_789") = 123456 → 123456 % 4 = 0 → Partition 0

Strengths

  • Even Distribution: A good hash function spreads data uniformly across partitions. No partition should be significantly larger than others.
  • Simple Lookup: Given a key, you can compute the partition in O(1) time without consulting any external service.
  • Predictable: The same key always maps to the same partition (until you change the number of partitions).

Weaknesses

  • No Range Queries: Want all users with IDs between 1000 and 2000? With hash partitioning, those users are scattered across all partitions. You must query every partition.
  • Resharding Hell: When you add or remove partitions, the modulo changes. hash(key) % 4 gives different results than hash(key) % 5. Suddenly, most of your data is on the "wrong" partition and needs to move.

When Hash Partitioning Breaks

Scenario: You're building a URL shortener. You hash the short URL code to determine which partition stores the mapping. Works great.

Now product asks: "Give me analytics on all URLs created in the last hour."

With hash partitioning, there's no way to find "recent URLs" without querying every partition. The creation timestamp has no relationship to the partition assignment.

Takeaway: Hash partitioning is excellent when you always access data by a single key. It's terrible when you need range-based access patterns.


2.2 Range Partitioning

How It Works

Divide the key space into contiguous ranges, with each partition owning a range.

Partition 0: keys A-F
Partition 1: keys G-L
Partition 2: keys M-R
Partition 3: keys S-Z

Or for numeric keys:

Partition 0: user_id 1 - 1,000,000
Partition 1: user_id 1,000,001 - 2,000,000
Partition 2: user_id 2,000,001 - 3,000,000

Strengths

  • Range Queries: Finding all users with IDs between 1,500,000 and 1,600,000? That's a query to a single partition.
  • Locality: Related data (e.g., users who signed up around the same time) often ends up on the same partition, which can improve query performance.
  • Flexible Boundaries: You can adjust partition boundaries without a complete reshuffle—just split one partition into two or merge two into one.

Weaknesses

  • Skewed Distribution: If your data isn't uniformly distributed across the key range, some partitions will be much larger than others.
  • Hot Partitions: New users get the highest IDs, so the last partition handles all new writes. That partition becomes a bottleneck.
  • Boundary Management: Someone or something needs to track partition boundaries and update them as data grows.

When Range Partitioning Breaks

Scenario: You partition users by user_id range. User IDs are auto-incrementing integers.

Your app goes viral. Thousands of new users sign up every minute. All new users have high IDs. All writes hit the last partition.

That partition is overloaded. The others sit idle.

Scenario 2: You partition time-series data by timestamp range. Each partition covers one month.

It's December. All writes go to the December partition. All 11 other partitions are read-only.

Takeaway: Range partitioning works well when access is spread across the range. It fails badly when access concentrates at one end of the range.


2.3 Directory-Based (Lookup) Partitioning

How It Works

Maintain a separate lookup table that maps keys to partitions. Every access first consults this directory to find the right partition.

Directory Table:
  key_range "user_1-1000"     → Partition A
  key_range "user_1001-5000"  → Partition B
  key_range "user_5001-5500"  → Partition A  (yes, can jump back)
  key_range "user_5501-10000" → Partition C

Strengths

  • Maximum Flexibility: You can assign any key to any partition. No algorithmic constraints.
  • Easy Rebalancing: Moving data between partitions just requires updating the directory, then migrating the data.
  • Handle Skew: If one partition gets hot, you can split its keys across multiple partitions.

Weaknesses

  • Single Point of Failure: The directory service becomes critical infrastructure. If it's down, no one can find their data.
  • Additional Latency: Every request requires a directory lookup before accessing data.
  • Consistency Challenges: The directory must stay in sync with actual data location during migrations.

When Directory-Based Partitioning Breaks

Scenario: Your directory service is a single Redis instance. It fails. Now no service can determine where any data lives. Complete outage.

Scenario 2: You're migrating data from Partition A to Partition B. You update the directory, but the data migration is still in progress. Requests go to Partition B, which doesn't have the data yet.

Takeaway: Directory-based partitioning is powerful but introduces operational complexity. You trade algorithmic simplicity for infrastructural overhead.


Chapter 3: The Resharding Problem

Every partitioning strategy must answer: What happens when you add or remove partitions?

3.1 The Naive Approach and Why It Hurts

With hash partitioning using hash(key) % N, changing N changes almost every key's partition assignment.

Example with 4 partitions, adding a 5th:

Before (N=4):
  hash("url_abc") = 100 → 100 % 4 = 0 → Partition 0
  hash("url_def") = 101 → 101 % 4 = 1 → Partition 1
  hash("url_ghi") = 102 → 102 % 4 = 2 → Partition 2
  hash("url_jkl") = 103 → 103 % 4 = 3 → Partition 3
  hash("url_mno") = 104 → 104 % 4 = 0 → Partition 0

After (N=5):
  hash("url_abc") = 100 → 100 % 5 = 0 → Partition 0  ✓ (same)
  hash("url_def") = 101 → 101 % 5 = 1 → Partition 1  ✓ (same)
  hash("url_ghi") = 102 → 102 % 5 = 2 → Partition 2  ✓ (same)
  hash("url_jkl") = 103 → 103 % 5 = 3 → Partition 3  ✓ (same)
  hash("url_mno") = 104 → 104 % 5 = 4 → Partition 4  ✗ (MOVED)

In this simple example, only 1 key moved. But statistically, when going from N to N+1 partitions, approximately (N-1)/N of all keys stay in place, meaning 1/N of keys move.

Going from 4 to 5 partitions: ~20% of data moves. Going from 100 to 101 partitions: ~1% of data moves.

But when going from 4 to 8 partitions (doubling), about 50% of data moves.

During migration:

  • Data exists in two places (old and new partition)
  • Reads might hit stale data
  • Writes must be coordinated carefully
  • This can take hours or days for large datasets

3.2 Consistent Hashing: The Elegant Solution

Consistent hashing minimizes data movement when partitions change.

The Concept

Imagine a circular hash space (a ring) from 0 to 2^32 - 1. Both keys and partitions are hashed onto this ring.

        0
        |
   P3 ●─┼─● P0
       \│/
    ────●────
       /│\
   P2 ●─┼─● P1
        |
      2^31

To find which partition owns a key:

  1. Hash the key to a position on the ring
  2. Walk clockwise until you hit a partition
  3. That partition owns the key

Why It Minimizes Movement

When you add a partition, you insert it at one point on the ring. Only keys between the new partition and its predecessor move. All other keys stay where they are.

Before adding P4:
  Keys in arc (P2, P3] → owned by P3

After adding P4 between P2 and P3:
  Keys in arc (P2, P4] → owned by P4  (moved from P3)
  Keys in arc (P4, P3] → owned by P3  (stayed)

Only keys in the specific arc move. With N partitions, adding one partition moves only 1/N of the data—but that data comes from just one or two existing partitions, not all of them.

Virtual Nodes: Solving the Balance Problem

Basic consistent hashing has a flaw: partitions might be unevenly spaced on the ring, leading to unequal load.

Solution: Instead of mapping each physical partition to one point on the ring, map it to many points (virtual nodes).

Physical Partition A → Virtual nodes at positions: 1000, 5000, 9000, 13000, ...
Physical Partition B → Virtual nodes at positions: 2500, 6500, 10500, 14500, ...
Physical Partition C → Virtual nodes at positions: 4000, 8000, 12000, 16000, ...

With hundreds of virtual nodes per physical partition, the key distribution becomes statistically even.

The Trade-off

Consistent hashing adds complexity:

  • You need to maintain the ring structure
  • Lookups require finding the next node on the ring (typically a tree structure)
  • Virtual node mappings must be stored and distributed

Is it worth it? Depends on how often you add/remove partitions and how painful data migration is for your system.


Part II: The Design Challenge

Chapter 4: Designing a URL Shortener at Scale

Now let's apply these concepts to a real system. You're designing a URL shortener that must handle:

  • 100 million URL creations per day (~1,150 per second average, expect 10x peaks)
  • 1 billion redirects per day (~11,500 per second average)
  • Read-heavy: 10:1 read-to-write ratio
  • Low latency: Redirects must complete in < 50ms at p99

4.1 Data Model

At its core, a URL shortener needs one mapping:

short_code → original_url

Additional metadata you might store:

class URLMapping:
    short_code: str        # e.g., "abc123"
    original_url: str      # e.g., "https://example.com/very/long/path"
    created_at: datetime
    expires_at: datetime | None
    creator_id: str | None
    click_count: int       # if you're tracking analytics

4.2 Choosing a Partition Key

Option A: Partition by short_code (hash)

partition = hash(short_code) % num_partitions

Pros:

  • Every redirect request includes the short_code; lookups are O(1)
  • Even distribution across partitions

Cons:

  • No way to query "all URLs created by user X" without hitting all partitions
  • No way to query "all URLs created today" efficiently

Option B: Partition by creator_id (hash)

partition = hash(creator_id) % num_partitions

Pros:

  • All URLs by one user are on one partition
  • Can query user's URLs efficiently

Cons:

  • Redirects don't include creator_id—you'd need a lookup first
  • Some users create millions of URLs; others create one. Skewed distribution.

Option C: Partition by creation timestamp (range)

partition = timestamp_to_partition(created_at)

Pros:

  • Time-range queries are efficient
  • Natural expiration—old partitions can be archived

Cons:

  • All writes hit the "current" partition
  • Redirects don't include timestamp—need global index

4.3 The Right Choice

For a URL shortener where the primary operation is redirect(short_code), hash partitioning by short_code is the clear winner.

# Simple and effective
def get_partition(short_code: str, num_partitions: int) -> int:
    return hash(short_code) % num_partitions

def redirect(short_code: str) -> str:
    partition = get_partition(short_code, NUM_PARTITIONS)
    db = get_db_connection(partition)
    result = db.query("SELECT original_url FROM urls WHERE short_code = ?", short_code)
    return result.original_url

For secondary access patterns (user's URLs, analytics), you have options:

  1. Accept the scatter-gather: Query all partitions, merge results
  2. Secondary index: Maintain a separate user_id → short_codes mapping
  3. Async replication: Copy data to an analytics-optimized store

4.4 Handling the Hot Key Problem

Your URL shortener is humming along. Then a celebrity tweets a link, and one short_code gets 1 million hits per second.

That short_code lives on one partition. That partition is now receiving 1M requests/second while others sit at 10K.

Detection

Before you can fix hot keys, you need to detect them:

# Track request counts per key in a sliding window
class HotKeyDetector:
    def __init__(self, threshold: int = 10000, window_seconds: int = 60):
        self.counts: dict[str, int] = defaultdict(int)
        self.threshold = threshold
        
    def record(self, key: str) -> bool:
        self.counts[key] += 1
        is_hot = self.counts[key] > self.threshold
        return is_hot

Mitigation Strategies

Strategy 1: Read Replicas for Hot Partitions

If partition 3 is hot because of reads, add read replicas specifically for partition 3.

                   ┌─────────────┐
                   │  Partition 3│
                   │   (Primary) │
                   └──────┬──────┘
                          │ replication
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    ┌──────────┐    ┌──────────┐    ┌──────────┐
    │ Replica 1│    │ Replica 2│    │ Replica 3│
    └──────────┘    └──────────┘    └──────────┘

Requests are load-balanced across replicas. The primary handles writes.

Strategy 2: Key Splitting (Salting)

Split one logical key into multiple physical keys across partitions.

# Instead of one key "viral_url"
# Create multiple: "viral_url:0", "viral_url:1", "viral_url:2", ...

def get_partitions_for_hot_key(key: str, split_factor: int = 10) -> list[int]:
    return [hash(f"{key}:{i}") % NUM_PARTITIONS for i in range(split_factor)]

def write_hot_key(key: str, value: str):
    # Write to all split keys
    for partition in get_partitions_for_hot_key(key):
        db = get_db_connection(partition)
        db.write(key, value)

def read_hot_key(key: str) -> str:
    # Read from random split
    partitions = get_partitions_for_hot_key(key)
    partition = random.choice(partitions)
    db = get_db_connection(partition)
    return db.read(key)

Strategy 3: Caching Layer

Put a cache in front of your partitions. Hot keys get served from cache, never hitting the database.

Request → Cache Check → [Hit] → Return cached value
                     → [Miss] → Query partition → Cache result → Return

For a URL shortener, caching is natural:

  • URLs rarely change after creation
  • High temporal locality (popular URLs get accessed repeatedly)
  • Small value size (just a URL string)
class CachedURLShortener:
    def __init__(self):
        self.cache = Redis()  # or local LRU cache
        self.cache_ttl = 3600  # 1 hour
        
    def redirect(self, short_code: str) -> str:
        # Try cache first
        cached = self.cache.get(short_code)
        if cached:
            return cached
            
        # Cache miss - hit database
        partition = get_partition(short_code)
        db = get_db_connection(partition)
        url = db.query("SELECT original_url FROM urls WHERE short_code = ?", short_code)
        
        # Cache for next time
        self.cache.setex(short_code, self.cache_ttl, url)
        return url

4.5 Complete Architecture

Putting it all together:

                           Load Balancer
                                 │
                 ┌───────────────┼───────────────┐
                 ▼               ▼               ▼
           ┌──────────┐   ┌──────────┐   ┌──────────┐
           │  App 1   │   │  App 2   │   │  App 3   │
           └────┬─────┘   └────┬─────┘   └────┬─────┘
                │              │              │
                └──────────────┼──────────────┘
                               │
                        ┌──────▼──────┐
                        │    Redis    │
                        │   (Cache)   │
                        └──────┬──────┘
                               │ cache miss
                ┌──────────────┼──────────────┐
                ▼              ▼              ▼
         ┌───────────┐  ┌───────────┐  ┌───────────┐
         │Partition 0│  │Partition 1│  │Partition 2│
         │  Primary  │  │  Primary  │  │  Primary  │
         └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
               │              │              │
         ┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐
         │ Replica 0 │  │ Replica 1 │  │ Replica 2 │
         └───────────┘  └───────────┘  └───────────┘

Flow for redirect:

  1. Request hits load balancer
  2. Any app server can handle it (stateless)
  3. App checks Redis cache
  4. On cache miss, computes partition from short_code
  5. Queries appropriate partition (read from replica for availability)
  6. Caches result, returns to user

Flow for create:

  1. Generate unique short_code
  2. Compute partition
  3. Write to partition primary
  4. Return short_code to user
  5. (Don't pre-cache—let first access populate cache)

Part III: Consistent Hashing Deep-Dive

Chapter 5: When and Why to Use Consistent Hashing

We've mentioned consistent hashing as a solution to the resharding problem. Let's explore when it's worth the complexity.

5.1 The Decision Framework

Use simple modulo hashing when:

  • Number of partitions is fixed or changes very rarely
  • You can afford downtime or degraded performance during resharding
  • Simplicity is a priority

Use consistent hashing when:

  • Partitions are added/removed frequently (auto-scaling)
  • You cannot afford the data movement of full resharding
  • Partition failures should only affect a subset of keys

5.2 Consistent Hashing for the URL Shortener

Let's evaluate consistent hashing for our URL shortener:

Resharding frequency: Low. You might add partitions once a quarter as you grow.

Data movement cost: Medium. Migrating millions of URLs takes hours but is doable in a maintenance window.

Partition failures: With replication, partition failures don't cause data loss. Consistent hashing helps with faster recovery.

Verdict: Consistent hashing is nice to have but not essential for this use case. The added complexity might not be worth it for a typical URL shortener.

When it would be essential:

  • If you're using a distributed cache layer (like Memcached cluster) where nodes come and go frequently
  • If you're running on spot instances that can be terminated anytime
  • If your read pattern is so heavy that any partition being unavailable causes unacceptable load on others

5.3 Implementing Consistent Hashing

If you do need consistent hashing, here's a practical implementation:

import bisect
import hashlib
from typing import List, Dict, Any

class ConsistentHash:
    def __init__(self, nodes: List[str] = None, virtual_nodes: int = 150):
        self.virtual_nodes = virtual_nodes
        self.ring: List[int] = []  # sorted list of hash values
        self.ring_to_node: Dict[int, str] = {}  # hash value → node name
        
        if nodes:
            for node in nodes:
                self.add_node(node)
    
    def _hash(self, key: str) -> int:
        """Generate a hash value for a key."""
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def add_node(self, node: str) -> None:
        """Add a node with its virtual nodes to the ring."""
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:{i}"
            hash_value = self._hash(virtual_key)
            
            # Insert into sorted ring
            bisect.insort(self.ring, hash_value)
            self.ring_to_node[hash_value] = node
    
    def remove_node(self, node: str) -> None:
        """Remove a node and its virtual nodes from the ring."""
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:{i}"
            hash_value = self._hash(virtual_key)
            
            self.ring.remove(hash_value)
            del self.ring_to_node[hash_value]
    
    def get_node(self, key: str) -> str:
        """Find the node responsible for a key."""
        if not self.ring:
            raise Exception("No nodes in ring")
        
        hash_value = self._hash(key)
        
        # Find first node with hash >= key's hash
        idx = bisect.bisect_right(self.ring, hash_value)
        
        # Wrap around to first node if we're past the end
        if idx == len(self.ring):
            idx = 0
        
        return self.ring_to_node[self.ring[idx]]


# Usage
ring = ConsistentHash(["partition-0", "partition-1", "partition-2"])

# All lookups go to their designated partition
print(ring.get_node("url_abc"))  # → "partition-1"
print(ring.get_node("url_def"))  # → "partition-0"
print(ring.get_node("url_ghi"))  # → "partition-2"

# Add a new partition - only some keys move
ring.add_node("partition-3")
print(ring.get_node("url_abc"))  # → might change to "partition-3"
print(ring.get_node("url_def"))  # → probably stays "partition-0"

Part IV: Discussion and Trade-offs

Chapter 6: The Hard Questions

These are the questions you should be asking (and answering) in your design sessions.

6.1 "What if one short URL goes viral and gets 1M hits/sec?"

This is the hot key problem we discussed. Your answer should include:

  1. Detection: How do you know a key is hot? (Metrics, anomaly detection)
  2. Mitigation: Caching is your first line of defense. A viral URL should be served from cache 99.9% of the time.
  3. Fallback: If cache fails, you need read replicas or key splitting.

Good answer: "We cache aggressively. A viral URL would be cached at the CDN edge, then in our Redis layer. Even at 1M hits/sec, the database partition would only see cache misses—maybe 100 requests/second if our cache hit rate is 99.99%."

6.2 "Would you use consistent hashing here? Why or why not?"

This is a trade-off question. There's no universally right answer.

Arguments for:

  • If we're auto-scaling partitions frequently
  • If we want to minimize data movement during scaling events
  • If we're using technologies that already implement it (like Redis Cluster)

Arguments against:

  • Added complexity in routing logic
  • Our scaling events are infrequent and planned
  • Simple modulo hashing is easier to debug
  • With good caching, we can tolerate brief inconsistencies during resharding

Good answer: "For a URL shortener, I'd start with simple modulo hashing. We can plan our scaling events, and the data movement during resharding is manageable. If we later find we're scaling frequently or can't afford the migration windows, we can migrate to consistent hashing. I wouldn't prematurely optimize for a problem we might not have."

6.3 "How do you handle partition failures?"

Every partition will fail eventually. Your design must handle this.

For reads:

  • Read replicas allow continued service
  • If all replicas for a partition fail, that partition's data is temporarily unavailable
  • Cache can serve hot keys during brief outages

For writes:

  • Writes go to the primary
  • If primary fails, you need failover to a replica
  • During failover, writes to that partition may fail or queue

Good answer: "Each partition has a primary and at least one replica. Reads go to replicas, so primary failure doesn't affect reads. For writes, we use automatic failover—if the primary is unresponsive for 10 seconds, a replica promotes itself. We might lose a few writes during the failover window, but for a URL shortener, that's acceptable. We're not a bank."

6.4 "How do you add a new partition without downtime?"

This is an operational question, but it affects your design.

With simple modulo hashing:

  1. Spin up new partition
  2. Calculate which keys need to move (those where hash(key) % new_count != hash(key) % old_count)
  3. Copy those keys to their new partitions (background process)
  4. Atomically switch routing to use new partition count
  5. Clean up old copies

The tricky part: During migration, a key might be on the old partition, the new partition, or in transit. Your routing logic must handle this:

def get_url(short_code: str) -> str:
    if is_migration_in_progress():
        # Try new partition first
        new_partition = get_partition_new(short_code)
        result = try_query(new_partition, short_code)
        if result:
            return result
        
        # Fall back to old partition
        old_partition = get_partition_old(short_code)
        return try_query(old_partition, short_code)
    else:
        partition = get_partition(short_code)
        return query(partition, short_code)

With consistent hashing:

  1. Add new node to the ring
  2. Keys that now belong to the new node are migrated in the background
  3. During migration, check both old and new owners for reads
  4. Fewer keys move, so migration is faster

Chapter 7: Session Summary

What You Should Know Now

After this session, you should be able to:

  1. Explain the three partitioning strategies (hash, range, directory) with their trade-offs
  2. Choose a partition key based on access patterns
  3. Identify when each strategy breaks and have mitigation strategies ready
  4. Explain consistent hashing and when it's worth the complexity
  5. Design a partitioned system (URL shortener) with explicit failure handling

Key Trade-offs to Remember

Decision Trade-off
Hash vs Range Even distribution vs Range queries
Simple vs Consistent hashing Simplicity vs Minimal resharding movement
More partitions Better parallelism vs More coordination overhead
Caching Reduced DB load vs Consistency complexity
Replication Availability vs Write latency and consistency

Questions to Ask in Every Design

  1. What is the access pattern? (Single-key lookup? Range? Aggregation?)
  2. What is the key that appears in most queries?
  3. How skewed is the data/traffic distribution?
  4. How often do we expect to add/remove partitions?
  5. What happens when a partition fails?

Part V: Interview Questions and Answers

Chapter 8: Real-World Interview Scenarios

This section covers the types of partitioning questions you'll face in system design interviews at top tech companies. Each question includes the interviewer's intent, a structured approach, and a strong answer.


8.1 Conceptual Questions

Question 1: "Explain the difference between horizontal and vertical partitioning. When would you use each?"

Interviewer's Intent: Testing fundamental understanding and ability to make trade-off decisions.

Strong Answer:

"Vertical partitioning splits a table by columns—you might put frequently accessed columns in one partition and rarely accessed or large columns (like BLOBs) in another. It's useful when you have wide tables with different access patterns per column group. For example, a user table might have profile_data accessed constantly but profile_photo accessed rarely—vertical partitioning keeps the hot data compact.

Horizontal partitioning (sharding) splits by rows—each partition holds a subset of rows based on some key. This is what you use when your data volume or write throughput exceeds what one machine can handle.

In practice, I'd use vertical partitioning first for optimization within a single database, and horizontal partitioning when I genuinely need to scale beyond one machine. Most systems I've worked on needed horizontal partitioning long before vertical became relevant—the row count was the problem, not the column width."


Question 2: "What are the pros and cons of hash partitioning versus range partitioning?"

Interviewer's Intent: Checking if you understand when each strategy fails.

Strong Answer:

"Hash partitioning gives you even data distribution—a good hash function spreads keys uniformly across partitions. The downside is you lose ordering. Range queries become scatter-gather operations across all partitions.

Range partitioning preserves order—all keys from A-F on one partition means you can efficiently query that range. But you risk hot spots. If keys are auto-incrementing or time-based, all writes hit one partition.

The decision depends on access patterns. For a key-value store where you always lookup by exact key—hash partitioning. For time-series data where you query by time ranges—range partitioning, but you'll need to handle the hot partition problem, maybe with additional sub-partitioning or load balancing.

A real example: at my previous company, we used hash partitioning for user data (lookup by user_id) but range partitioning for analytics events (query by time windows). Different access patterns, different strategies."


Question 3: "What is consistent hashing and why is it used?"

Interviewer's Intent: Testing knowledge of distributed systems fundamentals.

Strong Answer:

"Consistent hashing solves the resharding problem. With naive modulo hashing, adding a partition changes where almost every key maps—you'd have to migrate most of your data. Consistent hashing minimizes this movement.

The idea is to map both keys and nodes onto a ring. A key belongs to the first node you encounter walking clockwise from its position. When you add a node, only keys between the new node and its predecessor need to move—roughly 1/N of the data, and only from one existing node.

Virtual nodes improve balance. Instead of one point per physical node, you place hundreds of virtual nodes per physical node around the ring. This smooths out the distribution.

I'd use consistent hashing when I expect frequent scaling—like a distributed cache where nodes come and go—or when the cost of data migration is high. For a more stable database cluster that scales quarterly, simpler modulo hashing might be fine because I can plan migrations during maintenance windows."


Question 4: "How do you handle hot partitions?"

Interviewer's Intent: Checking practical experience with real distributed systems problems.

Strong Answer:

"Hot partitions happen when traffic or data concentrates on specific keys. Detection comes first—you need metrics on partition load: QPS, CPU, queue depth. Anomaly detection or simple threshold alerts can identify hot partitions.

For read-heavy hot spots, caching is the first defense. A hot key served from Redis or even in-memory cache doesn't hit the database. Second option is read replicas specifically for the hot partition.

For write-heavy hot spots, you can split the hot key artificially. If user_12345 is hot, internally treat it as user_12345:0 through user_12345:9, spreading writes across ten partitions. Reads aggregate across the splits.

If the partition itself is hot (not just one key), you can split the partition. With range partitioning, divide the range. With hash partitioning, you'd need to reshard.

A practical example: Instagram handles hot celebrity accounts by fan-out-on-read for most users but fan-out-on-write for users following celebrities. They essentially treat hot accounts differently in their partitioning strategy."


8.2 Design Questions

Question 5: "Design a key-value store that can handle 1 million writes per second."

Interviewer's Intent: Testing end-to-end system design with partitioning at the core.

Approach:

  1. Calculate partition requirements
  2. Choose partitioning strategy
  3. Design write path
  4. Address failure handling

Strong Answer:

"First, let me size this. If we assume each partition can handle 10K writes/sec (conservative for SSDs with batching), we need 100 partitions minimum. I'd plan for 200 to handle bursts and growth.

I'd use hash partitioning by key since our access pattern is key-based. Consistent hashing makes sense here because at this scale, we'll likely add/remove nodes frequently.

For the write path: client computes partition from key hash, sends write directly to that partition's leader. Each partition uses a write-ahead log for durability—append-only, sequential I/O is fast. The leader replicates to followers asynchronously for availability.

We batch writes—clients buffer for 1-5ms before sending, and the partition batches writes to disk. This amortizes I/O overhead.

For failures: if a partition leader dies, followers elect a new leader. We might lose a few milliseconds of async-replicated writes. If that's unacceptable, we'd do synchronous replication to at least one follower before acknowledging, trading latency for durability.

Consistent hashing means adding capacity is smooth—spin up new partitions, they take ownership of their key range, data migrates in the background."


Question 6: "How would you partition a social media feed system?"

Interviewer's Intent: Testing ability to choose partition keys with multiple access patterns.

Strong Answer:

"The core tension is between two access patterns: 'show me my feed' (read by viewer_id) and 'fan out a new post' (write by author_id, read by all followers).

Option 1: Partition by author_id. All of a user's posts are on one partition. Creating a post is fast—single partition write. But reading a feed requires querying every partition where you follow someone.

Option 2: Partition by viewer_id. Each user's feed is on one partition. Reads are fast—single partition. But posting requires writing to potentially millions of partitions if you have millions of followers.

Option 3: Hybrid approach. Small users (< 10K followers) use fan-out-on-write—when they post, we write to all followers' partitions. Large users (celebrities) use fan-out-on-read—we only store the post once and merge it into feeds at read time.

I'd go with Option 3. For partitioning the feed storage itself, I'd partition by viewer_id since reads are more frequent than writes. Each partition stores feeds for a subset of users.

The feed table would be: (viewer_id, post_id, timestamp, author_id, ...) partitioned by hash of viewer_id. When someone opens their app, we hit one partition, get their precomputed feed, and merge in any celebrity posts in the application layer."


Question 7: "Design a distributed counter that can handle 100K increments per second for millions of different counters."

Interviewer's Intent: Testing understanding of partitioning combined with consistency trade-offs.

Strong Answer:

"First, the constraints. 100K increments/second globally, millions of counters. Most counters are cold (few increments), some are very hot (viral content).

Naive approach: partition counters by counter_id using hash partitioning. Each increment goes to the right partition, which atomically increments. Problem: a hot counter overwhelms one partition.

Better approach: approximate counting with partition-local counters. Each partition maintains a local count. Periodically (every second or every N increments), local counts are aggregated to a global count. Reads return a slightly stale but much more available count.

For exact counting under high load: use a write-ahead log per partition. Increments append to the log. A background process aggregates. Reads of hot counters could hit a cache that's updated by the aggregator.

Even better for hot counters specifically: detect hot counters (incrementing > 1000/sec), and for those, switch to probabilistic counting (HyperLogLog for uniques, or just accept eventual consistency with wider aggregation windows).

The partition key would be counter_id. I'd use consistent hashing so we can add capacity for hot counters. The data model per partition:

counter_id -> {current_value, pending_increments[], last_aggregated_at}

Pending increments get batched and applied, reducing write amplification."


8.3 Scenario-Based Questions

Question 8: "You have a database partitioned by user_id. A new requirement comes in to query all orders for a given product_id. How do you handle this?"

Interviewer's Intent: Testing real-world problem solving when partitioning doesn't match access patterns.

Strong Answer:

"This is the secondary index problem. Our primary partition key is user_id, but we need to query by product_id.

Option 1: Scatter-gather. Query all partitions, filter by product_id, merge results. Works for infrequent queries. Doesn't scale for high-QPS queries.

Option 2: Local secondary index. Each partition maintains an index of product_id → orders for that partition. Query still hits all partitions, but each partition query is indexed. Better than full scans, still O(partitions).

Option 3: Global secondary index. Separate index partitioned by product_id. Each entry points to (user_id, order_id). Query hits one partition of the index, then fetches from relevant user partitions. Two hops but predictable performance.

Option 4: Denormalization. Store orders twice: once partitioned by user_id (for user queries), once partitioned by product_id (for product queries). Writes are doubled, but reads are single-partition.

I'd choose based on query frequency. If product_id queries are rare (admin reports), scatter-gather is fine. If they're frequent (real-time product analytics), I'd build a global secondary index or denormalize.

The trade-off is write amplification versus read performance. For an e-commerce system, I'd probably denormalize—write once to the user-partitioned table, queue an async job to update the product-partitioned view. Eventual consistency is acceptable for product analytics."


Question 9: "Your partition is running out of disk space but you can't add more partitions right now. What do you do?"

Interviewer's Intent: Testing operational problem-solving and understanding of data lifecycle.

Strong Answer:

"Short-term triage first. Can I add disk to this specific partition? If it's a cloud volume, resize it. If it's physical, add a disk and extend the filesystem or mount point.

If I can't add disk, I need to reduce data:

  1. Archive old data: Move cold data to cheaper storage (S3, glacier). Partition by time if possible so old partitions are entirely archivable.

  2. Compression: Enable or increase compression. Some databases support transparent compression. Trade CPU for disk.

  3. TTL enforcement: If data has expiration, make sure TTL cleanup is running. Sometimes cleanup jobs fall behind.

  4. Delete soft-deleted data: If we have logical deletes, hard-delete old tombstoned records.

  5. Split the partition: Emergency resharding. Add a new partition, move half the data. Painful but sometimes necessary.

For the root cause: why is one partition out of space? Data skew? Unexpected growth? I'd investigate whether we need to rebalance or if the partition key is causing uneven distribution.

Longer-term, I'd set up monitoring and alerting for partition size. Trigger alerts at 70% capacity, pages at 85%, so we have time to react before hitting limits."


Question 10: "During a resharding operation, how do you handle requests for data that might be in transit between partitions?"

Interviewer's Intent: Testing understanding of consistency during migrations.

Strong Answer:

"This is about maintaining correctness during the transition. The key insight is that during migration, a key might exist on the old partition, the new partition, or be in-flight.

Approach 1: Dual reads. For any key that might be migrating, read from both old and new partitions. Return the newer version (by timestamp or version number). Writes go to the new partition.

def read_during_migration(key):
    if is_migrating(key):
        old_result = read_from_old_partition(key)
        new_result = read_from_new_partition(key)
        return newer_of(old_result, new_result)
    else:
        return read_from_current_partition(key)

Approach 2: Forwarding. Old partition knows it doesn't own the key anymore, forwards to new partition. New partition handles the request (fetching from old if needed).

Approach 3: Freeze-and-copy. Stop writes to migrating keys, copy data, switch routing, resume writes. Simple but causes brief write unavailability.

I prefer Approach 1 for read-heavy workloads. The overhead of dual reads is acceptable for a bounded migration period.

For writes during migration, I'd route to the new partition immediately and have the new partition pull any existing data from the old partition on first write if needed.

The migration state machine would be:

  1. Mark key range as migrating
  2. Copy data from old to new (background)
  3. Dual-read, write-to-new
  4. When copy complete, mark as migrated
  5. Cleanup old partition"

8.4 Deep-Dive Questions

Question 11: "How does DynamoDB handle partitioning, and what are its limitations?"

Interviewer's Intent: Testing knowledge of real-world systems.

Strong Answer:

"DynamoDB partitions by the partition key (or hash key). Each partition can hold about 10GB of data and handle about 3,000 RCU or 1,000 WCU (read/write capacity units).

The clever part is automatic splitting. When a partition gets too big or too hot, DynamoDB splits it transparently. You don't manage partition counts.

But this creates the hot partition problem. If one partition key is hot, it's stuck on one partition. DynamoDB has adaptive capacity that can borrow capacity from cold partitions, but there's still a ceiling per partition.

Limitations:

  1. Partition key design is critical. A bad partition key (like a boolean or low-cardinality field) means all your data lands on few partitions.
  2. No cross-partition transactions (until recently with TransactWriteItems, which has overhead).
  3. Eventual consistency by default. Strong consistency costs double the RCU.
  4. GSI propagation is async. If you query a global secondary index right after writing, you might not see the update.

I'd use DynamoDB for truly key-value access patterns with high cardinality partition keys. For relational queries or strong consistency needs, I'd look at PostgreSQL (Aurora if I need scaling) or CockroachDB."


Question 12: "Compare partitioning strategies in PostgreSQL, MongoDB, and Cassandra."

Interviewer's Intent: Testing breadth of knowledge across database types.

Strong Answer:

"PostgreSQL (native partitioning): Declarative partitioning since version 10. Supports range, list, and hash partitioning. The query planner uses partition pruning—if your WHERE clause matches the partition key, it only scans relevant partitions. But it's single-node by default. For distributed PostgreSQL, you need Citus (which adds hash distribution across workers) or CockroachDB.

MongoDB: Sharding by a shard key. Supports both hashed and range-based shard keys. The mongos router directs queries to the right shard. Chunks (ranges of shard key values) can be split and migrated between shards automatically. The limitation is that the shard key is immutable—choose wrong initially and you're stuck. Also, queries that don't include the shard key go to all shards.

Cassandra: Partitions by the partition key using consistent hashing. Each node owns a token range on the ring. The partition key determines the token, which determines the node. Cassandra is designed for this—every table must have a partition key, and queries without the partition key are discouraged (require ALLOW FILTERING). The trade-off is flexibility: you design tables around query patterns, often duplicating data.

Summary:

  • PostgreSQL: Traditional RDBMS, partitioning for manageability and query optimization
  • MongoDB: Document store, sharding for horizontal scale with some relational features
  • Cassandra: Distributed-first, extreme write throughput, but strict data modeling requirements

I'd pick PostgreSQL for complex queries and ACID needs, MongoDB for flexible documents with moderate scale, Cassandra for massive write throughput with well-defined access patterns."


Question 13: "What is the difference between partitioning and sharding?"

Interviewer's Intent: Clarifying terminology (often used interchangeably, but distinctions exist).

Strong Answer:

"The terms are often used interchangeably, but technically there's a distinction:

Partitioning is the general concept of dividing data into subsets. It can be within a single database instance—like PostgreSQL table partitioning where all partitions are on the same server but stored in separate files for manageability.

Sharding specifically refers to horizontal partitioning across multiple database instances or servers. Each shard is a separate database that can run on different hardware.

So all sharding is partitioning, but not all partitioning is sharding.

In practice:

  • 'Partitioned table' in PostgreSQL → single server, multiple physical files
  • 'Sharded database' → multiple servers, each holding a subset of data

The distinction matters for:

  • Queries: Partitioned tables on one server can still do efficient joins. Sharded data across servers makes joins expensive (need to shuffle data).
  • Transactions: Partitions on one server can share a transaction. Shards need distributed transactions.
  • Administration: Partitions are managed by one database. Shards need coordination across multiple databases.

In conversation, if someone says 'partitioned' and means across servers, I'd clarify whether they mean logical partitioning (query optimization) or physical sharding (horizontal scale)."


8.5 Curveball Questions

Question 14: "Can you have too many partitions? What are the downsides?"

Interviewer's Intent: Testing nuanced understanding—more isn't always better.

Strong Answer:

"Absolutely. More partitions means more overhead:

  1. Metadata overhead: Each partition needs tracking—where it lives, its health, its size. Thousands of partitions mean megabytes of metadata that needs to be consistent across the cluster.

  2. Connection overhead: If your application connects to all partitions, that's N connections per app server. With 1000 partitions and 100 app servers, you're managing 100,000 connections.

  3. Query overhead: For queries that can't be pruned to specific partitions, you're doing N sub-queries instead of one. Coordinator overhead grows linearly.

  4. Replication overhead: If each partition has replicas, total machines = partitions × replication factor. 100 partitions × 3 replicas = 300 partition-replicas to manage.

  5. Rebalancing complexity: Moving data during rebalancing is proportional to partition count. More partitions, more moves.

The right number depends on your data size and throughput. Rules of thumb:

  • Each partition should hold at least 1GB (small partitions waste resources)
  • Each partition should handle at least 100 QPS (otherwise you're over-partitioned)
  • Start with fewer partitions than you think you need; splitting is easier than merging

I've seen teams over-partition preemptively—'we might need 100 shards someday'—and suffer the overhead for years while their data fit on 5 shards."


Question 15: "A junior engineer suggests partitioning by the hash of (user_id + timestamp) to spread load. What's wrong with this approach?"

Interviewer's Intent: Testing ability to catch subtle design flaws.

Strong Answer:

"The problem is that you've lost the ability to query by user_id alone. If you hash (user_id + timestamp), each combination maps to a different partition. To get all data for user_123, you'd need to know every timestamp and query every partition.

You've essentially created a random partition key. You'll have excellent load distribution but terrible query performance for any practical access pattern.

What they probably wanted was to avoid hot spots for users with high activity. Better approaches:

  1. Partition by user_id, sub-partition by time: User data is co-located, but within a user's partition, you can time-range prune.

  2. Add salt for hot users only: Most users use hash(user_id). Detected hot users use hash(user_id + salt) where salt cycles 0-9, spreading their data across 10 partitions.

  3. Composite key with user_id first: hash(user_id) determines partition. (user_id, timestamp) is the full primary key for ordering within the partition.

The key insight is that your partition key should match your most common query pattern. If you always query by user, partition by user. Timestamp adds value as a sort key within the partition, not as part of the partition key."


8.6 Take-Home Assignment Style

Question 16: "Design the partitioning strategy for a ride-sharing app like Uber."

Interviewer's Intent: Testing comprehensive system thinking.

Strong Answer:

"Let me break this into the main data entities and their access patterns:

1. Ride Requests (hot, real-time)

  • Access: By rider_id (my rides), by driver_id (driver's rides), by geography (match riders to nearby drivers)
  • Strategy: Geospatial partitioning for matching. Use geo-hashing (like S2 cells or H3 hexagons). Each partition owns a geographic region. This co-locates riders and drivers in the same area.
  • Challenge: City centers are hot. Sub-partition dense areas or replicate hot regions.

2. User Profiles (rider + driver)

  • Access: By user_id
  • Strategy: Hash partition by user_id. Standard key-value access pattern.

3. Ride History

  • Access: By rider_id (my past rides), by driver_id (their past rides), by time range (analytics)
  • Strategy: Partition by user_id (could be rider or driver, pick one as primary). For cross-user analytics, replicate to an analytics-optimized store partitioned by time.

4. Real-time Location (extremely hot)

  • Access: Driver locations updated every few seconds. Queries for nearby drivers.
  • Strategy: This isn't traditional partitioning—use a specialized spatial index like Redis with geospatial commands, or a purpose-built system. Partition by geo-region if you need to scale beyond one node.

5. Payments

  • Access: By ride_id, by user_id
  • Strategy: Partition by user_id for user-facing queries. Separate payment_id-partitioned store for the payment processor that needs to look up by transaction.

Architecture sketch:

┌─────────────────────────────────────────────────┐
│                   API Gateway                   │
└───────────────────────┬─────────────────────────┘
                        │
        ┌───────────────┼───────────────┐
        ▼               ▼               ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Matching  │  │    User     │  │   Ride      │
│   Service   │  │   Service   │  │   History   │
│  (geo-part) │  │ (user-part) │  │ (user-part) │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       ▼                ▼                ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Location  │  │   User DB   │  │  Rides DB   │
│   Index     │  │(hash by id) │  │(hash by uid)│
│  (geo-hash) │  │             │  │             │
└─────────────┘  └─────────────┘  └─────────────┘

Key insight: Different data has different partitioning needs. Don't force one strategy across everything."


Chapter 9: Interview Preparation Checklist

Before your interview, make sure you can:

Concepts

  • Explain hash vs range vs directory partitioning in 2 minutes
  • Describe consistent hashing with virtual nodes
  • List 3 ways to handle hot partitions
  • Explain the resharding problem and solutions

Designs

  • Design a partitioned key-value store from scratch
  • Choose partition keys for multi-access-pattern systems
  • Handle secondary indexes across partitions

Trade-offs

  • Articulate when to use each partitioning strategy
  • Explain consistency vs availability in partitioned systems
  • Discuss when consistent hashing is worth the complexity

Operations

  • Describe a zero-downtime resharding process
  • Explain how to detect and mitigate hot partitions
  • Handle queries during partition migrations

Real Systems

  • Know how DynamoDB/Cassandra/MongoDB approach partitioning
  • Cite real examples (Discord, Instagram, Uber) of partitioning decisions

Exercises

Exercise 1: Partition Key Selection

For each system, identify the best partition key and strategy:

  1. E-commerce orders: Need to query by order_id (most common), by customer_id, and by date range for analytics
  2. Chat messages: Need to query all messages in a conversation, and all conversations for a user
  3. IoT sensor readings: Need to query readings for a sensor in a time range

Exercise 2: Hot Key Mitigation

Design a hot key detection and mitigation system for a social media post service where:

  • 99.9% of posts get < 100 reads/day
  • 0.1% of posts get > 1M reads/day
  • You need to detect and handle hot posts within 60 seconds of going viral

Exercise 3: Resharding Plan

You have a URL shortener with 4 partitions, each holding 250M URLs. You need to scale to 8 partitions. Write a detailed migration plan that:

  • Minimizes downtime
  • Handles reads and writes during migration
  • Has rollback capability

Further Reading

These resources go deeper on specific topics:

  • "Designing Data-Intensive Applications" Chapter 6: Partitioning (The definitive reference)
  • Amazon DynamoDB Paper: "Dynamo: Amazon's Highly Available Key-value Store" (Original consistent hashing in practice)
  • Discord's Message Storage: How they partition by channel_id and handle hot channels
  • Slack's Flannel: Their edge caching and partitioning approach

Appendix: Code Reference

A.1 Complete URL Shortener Partition Logic

from typing import Optional
import hashlib
import redis
import psycopg2
from dataclasses import dataclass
from datetime import datetime

@dataclass
class URLMapping:
    short_code: str
    original_url: str
    created_at: datetime
    creator_id: Optional[str] = None

class PartitionedURLShortener:
    def __init__(self, num_partitions: int, partition_configs: list[dict]):
        self.num_partitions = num_partitions
        self.connections = [
            psycopg2.connect(**config) 
            for config in partition_configs
        ]
        self.cache = redis.Redis()
        self.cache_ttl = 3600
    
    def _get_partition(self, short_code: str) -> int:
        """Compute partition for a short code using hash."""
        hash_bytes = hashlib.md5(short_code.encode()).digest()
        hash_int = int.from_bytes(hash_bytes[:4], byteorder='big')
        return hash_int % self.num_partitions
    
    def create(self, short_code: str, original_url: str, creator_id: Optional[str] = None) -> URLMapping:
        """Create a new short URL."""
        partition = self._get_partition(short_code)
        conn = self.connections[partition]
        
        with conn.cursor() as cur:
            cur.execute(
                """
                INSERT INTO urls (short_code, original_url, creator_id, created_at)
                VALUES (%s, %s, %s, NOW())
                RETURNING created_at
                """,
                (short_code, original_url, creator_id)
            )
            created_at = cur.fetchone()[0]
            conn.commit()
        
        return URLMapping(
            short_code=short_code,
            original_url=original_url,
            created_at=created_at,
            creator_id=creator_id
        )
    
    def redirect(self, short_code: str) -> Optional[str]:
        """Get the original URL for a short code."""
        # Try cache first
        cached = self.cache.get(f"url:{short_code}")
        if cached:
            return cached.decode('utf-8')
        
        # Cache miss - query database
        partition = self._get_partition(short_code)
        conn = self.connections[partition]
        
        with conn.cursor() as cur:
            cur.execute(
                "SELECT original_url FROM urls WHERE short_code = %s",
                (short_code,)
            )
            result = cur.fetchone()
        
        if result:
            original_url = result[0]
            # Cache for next time
            self.cache.setex(f"url:{short_code}", self.cache_ttl, original_url)
            return original_url
        
        return None
    
    def delete(self, short_code: str) -> bool:
        """Delete a short URL."""
        # Invalidate cache
        self.cache.delete(f"url:{short_code}")
        
        # Delete from database
        partition = self._get_partition(short_code)
        conn = self.connections[partition]
        
        with conn.cursor() as cur:
            cur.execute(
                "DELETE FROM urls WHERE short_code = %s",
                (short_code,)
            )
            deleted = cur.rowcount > 0
            conn.commit()
        
        return deleted

A.2 Hot Key Detection

import time
from collections import defaultdict
from threading import Lock
from dataclasses import dataclass

@dataclass
class HotKeyAlert:
    key: str
    count: int
    window_seconds: int
    detected_at: float

class SlidingWindowHotKeyDetector:
    def __init__(
        self, 
        window_seconds: int = 60,
        threshold: int = 10000,
        cleanup_interval: int = 10
    ):
        self.window_seconds = window_seconds
        self.threshold = threshold
        self.cleanup_interval = cleanup_interval
        
        # key -> list of timestamps
        self.access_times: dict[str, list[float]] = defaultdict(list)
        self.lock = Lock()
        self.last_cleanup = time.time()
    
    def record_access(self, key: str) -> Optional[HotKeyAlert]:
        """Record an access and check if key is hot."""
        now = time.time()
        
        with self.lock:
            # Maybe cleanup old entries
            if now - self.last_cleanup > self.cleanup_interval:
                self._cleanup(now)
                self.last_cleanup = now
            
            # Add this access
            self.access_times[key].append(now)
            
            # Count accesses in window
            cutoff = now - self.window_seconds
            recent = [t for t in self.access_times[key] if t > cutoff]
            self.access_times[key] = recent
            
            if len(recent) >= self.threshold:
                return HotKeyAlert(
                    key=key,
                    count=len(recent),
                    window_seconds=self.window_seconds,
                    detected_at=now
                )
        
        return None
    
    def _cleanup(self, now: float) -> None:
        """Remove old entries to prevent memory growth."""
        cutoff = now - self.window_seconds
        
        keys_to_remove = []
        for key, times in self.access_times.items():
            recent = [t for t in times if t > cutoff]
            if recent:
                self.access_times[key] = recent
            else:
                keys_to_remove.append(key)
        
        for key in keys_to_remove:
            del self.access_times[key]

End of Day 1: Partitioning Deep-Dive

Tomorrow: Day 2 — Replication Trade-offs. We'll extend our URL shortener with read replicas and confront the reality of async replication: what happens when a user creates a URL and their friend clicks it before the replica has caught up?