Day 04

Week 1 — Day 4: Hot Keys and Skew

System Design Mastery Series

Preface

Days 1-3 taught you how to partition, replicate, and protect your data. We made a convenient assumption throughout: traffic is evenly distributed. Today, we shatter that assumption.

In the real world, data access follows power laws:

0.01% of URLs get 90% of clicks
A celebrity tweet gets 10 million views in an hour
One product goes viral on TikTok and crashes your inventory system
A single customer generates 50% of your API traffic

These hot keys break every carefully designed distributed system. A perfectly partitioned database becomes a single-node bottleneck when one partition receives all the traffic.

By the end of this session, you'll understand why hot keys happen, how to detect them before they cause outages, and multiple strategies for mitigating them—from caching to key splitting to architectural redesigns.

Let's begin.

Part I: Foundations

Chapter 1: Understanding Hot Keys and Skew

1.1 What Makes a Key "Hot"?

A key is "hot" when it receives disproportionately more operations than other keys. This creates a bottleneck because:

With partitioning: All requests for that key go to one partition, overwhelming it while others sit idle
With caching: Cache eviction policies may not keep the hot key cached
With rate limiting: The hot key may exhaust quotas intended for many keys

Quantifying "hot": A key is problematically hot when:

It receives >1000x the average key's traffic
It consumes >10% of a partition's capacity
It causes measurable latency degradation for other keys on the same partition

1.2 The Mathematics of Skew: Zipf's Law

Traffic to most systems follows Zipf's Law: the nth most popular item receives approximately 1/n of the traffic of the most popular item.

Rank 1:  1000 requests/sec
Rank 2:  500 requests/sec  (1/2)
Rank 3:  333 requests/sec  (1/3)
Rank 4:  250 requests/sec  (1/4)
...
Rank 100: 10 requests/sec  (1/100)

This means:

The top 1% of keys receive ~30% of traffic
The top 10% of keys receive ~65% of traffic
The top 20% of keys receive ~80% of traffic

In extreme cases (viral content, celebrity accounts):

The top 0.01% of keys receive >90% of traffic

1.3 Types of Hot Key Scenarios

Scenario 1: Predictable Hot Keys

Some keys are always hot—you know in advance which ones:

The homepage of a website
Popular product categories
Well-known user accounts

These are easier to handle because you can pre-optimize.

Scenario 2: Sudden Hot Keys (Viral Content)

A previously cold key suddenly becomes hot:

A tweet goes viral
A URL is shared on Reddit's front page
A product is featured on a TV show

These are harder because you can't pre-optimize—you must detect and react.

Scenario 3: Temporal Hot Keys

Keys that are hot only during specific times:

Flash sales (specific product IDs)
Live events (event IDs during the event)
Time-based aggregations (current hour's counters)

These are somewhat predictable but require dynamic handling.

Scenario 4: Hot Partitions (Not Just Keys)

Sometimes the problem isn't one key but the partition assignment:

Range partitioning with auto-increment IDs: newest partition always hot
Hash collision: multiple unrelated keys hash to same partition
Correlated access: keys that are always accessed together land on same partition

1.4 The Impact of Hot Keys

On latency:

Normal key:   P50: 5ms,  P99: 20ms
Hot key:      P50: 50ms, P99: 500ms  (10x worse)
Other keys on same partition: P50: 15ms, P99: 100ms (3-5x worse due to contention)

On availability:

Hot partition becomes overloaded
Health checks fail
Partition marked unhealthy
Traffic rerouted to replicas (which also become overloaded)
Cascade failure

On cost:

Over-provisioning to handle peak hot key traffic
10x infrastructure cost for 0.01% of keys

Chapter 2: Detecting Hot Keys

You can't fix what you can't see. Detection is the first step.

2.1 Real-Time Detection

Approach 1: Request Counting with Streaming

Count requests per key in a sliding window. Alert when any key exceeds a threshold.

from collections import defaultdict
import time
import threading

class HotKeyDetector:
    def __init__(
        self,
        window_seconds: int = 60,
        hot_threshold: int = 10000,
        check_interval: float = 1.0
    ):
        self.window_seconds = window_seconds
        self.hot_threshold = hot_threshold
        self.check_interval = check_interval
        
        # key -> list of timestamps
        self.request_times: dict[str, list[float]] = defaultdict(list)
        self.lock = threading.Lock()
        self.hot_keys: set[str] = set()
        
        # Start background checker
        self._start_checker()
    
    def record(self, key: str) -> bool:
        """Record a request. Returns True if key is hot."""
        now = time.time()
        
        with self.lock:
            self.request_times[key].append(now)
            return key in self.hot_keys
    
    def _check_hot_keys(self):
        """Periodic check for hot keys."""
        now = time.time()
        cutoff = now - self.window_seconds
        
        with self.lock:
            new_hot_keys = set()
            keys_to_clean = []
            
            for key, times in self.request_times.items():
                # Filter to window
                recent = [t for t in times if t > cutoff]
                
                if len(recent) == 0:
                    keys_to_clean.append(key)
                else:
                    self.request_times[key] = recent
                    if len(recent) >= self.hot_threshold:
                        new_hot_keys.add(key)
            
            # Clean up cold keys
            for key in keys_to_clean:
                del self.request_times[key]
            
            # Detect newly hot keys
            newly_hot = new_hot_keys - self.hot_keys
            no_longer_hot = self.hot_keys - new_hot_keys
            
            if newly_hot:
                self._alert_new_hot_keys(newly_hot)
            if no_longer_hot:
                self._alert_cooled_keys(no_longer_hot)
            
            self.hot_keys = new_hot_keys
    
    def _alert_new_hot_keys(self, keys: set[str]):
        print(f"🔥 NEW HOT KEYS DETECTED: {keys}")
        # Send to monitoring system, trigger mitigation
    
    def _alert_cooled_keys(self, keys: set[str]):
        print(f"❄️ Keys cooled down: {keys}")
    
    def _start_checker(self):
        def run():
            while True:
                time.sleep(self.check_interval)
                self._check_hot_keys()
        
        thread = threading.Thread(target=run, daemon=True)
        thread.start()

Problem: This stores every timestamp, which is expensive for high-traffic keys.

Approach 2: Count-Min Sketch (Probabilistic)

Use a probabilistic data structure for approximate counting with fixed memory.

import mmh3  # MurmurHash
import numpy as np

class CountMinSketch:
    """
    Probabilistic data structure for frequency estimation.
    Uses fixed memory regardless of number of unique keys.
    """
    
    def __init__(self, width: int = 10000, depth: int = 7):
        self.width = width
        self.depth = depth
        self.table = np.zeros((depth, width), dtype=np.int64)
    
    def _hash(self, key: str, seed: int) -> int:
        return mmh3.hash(key, seed) % self.width
    
    def increment(self, key: str, count: int = 1):
        for i in range(self.depth):
            idx = self._hash(key, i)
            self.table[i][idx] += count
    
    def estimate(self, key: str) -> int:
        """Returns estimated count (may overestimate, never underestimates)."""
        min_count = float('inf')
        for i in range(self.depth):
            idx = self._hash(key, i)
            min_count = min(min_count, self.table[i][idx])
        return int(min_count)
    
    def reset(self):
        """Reset all counters (call periodically for time-windowed counting)."""
        self.table.fill(0)


class CMSHotKeyDetector:
    """Hot key detection using Count-Min Sketch."""
    
    def __init__(self, hot_threshold: int = 10000, window_seconds: int = 60):
        self.cms = CountMinSketch()
        self.hot_threshold = hot_threshold
        self.window_seconds = window_seconds
        
        # Track potential hot keys for exact counting
        self.potential_hot_keys: dict[str, int] = {}
        self.confirmed_hot_keys: set[str] = set()
        
        # Periodic reset
        self._start_reset_timer()
    
    def record(self, key: str) -> bool:
        """Record request. Returns True if key is confirmed hot."""
        self.cms.increment(key)
        estimate = self.cms.estimate(key)
        
        # If estimate exceeds threshold, do exact counting
        if estimate >= self.hot_threshold * 0.8:  # 80% threshold for promotion
            self.potential_hot_keys[key] = self.potential_hot_keys.get(key, 0) + 1
            
            if self.potential_hot_keys[key] >= self.hot_threshold:
                if key not in self.confirmed_hot_keys:
                    self.confirmed_hot_keys.add(key)
                    self._on_hot_key_detected(key)
                return True
        
        return key in self.confirmed_hot_keys
    
    def _on_hot_key_detected(self, key: str):
        print(f"🔥 HOT KEY: {key}")
    
    def _start_reset_timer(self):
        def reset_loop():
            while True:
                time.sleep(self.window_seconds)
                self.cms.reset()
                self.potential_hot_keys.clear()
                # Keep confirmed_hot_keys until explicitly cleared
        
        threading.Thread(target=reset_loop, daemon=True).start()

Advantages:

Fixed memory: O(width × depth) regardless of key count
Fast: O(depth) per operation
Good enough for detection (some false positives, no false negatives)

2.2 Monitoring and Alerting

Set up dashboards and alerts for:

# Prometheus metrics for hot key monitoring

from prometheus_client import Counter, Histogram, Gauge

# Request count per key (use with care - high cardinality)
key_requests = Counter(
    'requests_by_key_total',
    'Total requests per key',
    ['key']
)

# Hot key gauge
hot_key_count = Gauge(
    'hot_keys_count',
    'Number of currently hot keys'
)

# Request latency by partition
partition_latency = Histogram(
    'partition_request_latency_seconds',
    'Request latency by partition',
    ['partition_id'],
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

# Partition load imbalance
partition_load = Gauge(
    'partition_requests_per_second',
    'Requests per second per partition',
    ['partition_id']
)

Alert conditions:

# Prometheus alerting rules
groups:
  - name: hot_keys
    rules:
      - alert: HotKeyDetected
        expr: rate(requests_by_key_total[1m]) > 1000
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: "Hot key detected: {{ $labels.key }}"
      
      - alert: PartitionImbalance
        expr: |
          max(partition_requests_per_second) / 
          avg(partition_requests_per_second) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Partition load imbalance detected"

2.3 Predictive Detection

Don't just react—predict hot keys before they cause problems.

Using Access Patterns

class PredictiveHotKeyDetector:
    """
    Predict hot keys based on access pattern acceleration.
    A key becoming hot shows rapid growth before hitting threshold.
    """
    
    def __init__(self, growth_threshold: float = 10.0, window_seconds: int = 60):
        self.growth_threshold = growth_threshold
        self.window_seconds = window_seconds
        
        # Track request rates over time
        self.rate_history: dict[str, list[tuple[float, int]]] = defaultdict(list)
    
    def record_rate(self, key: str, current_rate: int):
        """Record current rate for a key."""
        now = time.time()
        self.rate_history[key].append((now, current_rate))
        
        # Keep only recent history
        cutoff = now - self.window_seconds
        self.rate_history[key] = [
            (t, r) for t, r in self.rate_history[key] if t > cutoff
        ]
        
        # Check for acceleration
        if self._is_accelerating(key):
            self._alert_predicted_hot_key(key, current_rate)
    
    def _is_accelerating(self, key: str) -> bool:
        history = self.rate_history[key]
        if len(history) < 3:
            return False
        
        # Compare recent rate to older rate
        recent = history[-1][1]
        older = history[0][1]
        
        if older == 0:
            return recent > 100  # Sudden spike from zero
        
        growth = recent / older
        return growth >= self.growth_threshold
    
    def _alert_predicted_hot_key(self, key: str, rate: int):
        print(f"⚠️ PREDICTED HOT KEY: {key} (rate: {rate}, accelerating)")

External Signals

Monitor external sources that predict traffic:

Social media mentions (Twitter API, Reddit)
News mentions
Marketing campaign schedules
Scheduled events (product launches, sales)

class ExternalSignalMonitor:
    """Monitor external signals that predict hot keys."""
    
    def __init__(self):
        self.known_upcoming_hot_keys: dict[str, dict] = {}
    
    def register_marketing_campaign(self, key: str, start_time: datetime, expected_traffic: int):
        """Marketing knows they're about to drive traffic to a URL."""
        self.known_upcoming_hot_keys[key] = {
            'type': 'marketing_campaign',
            'start_time': start_time,
            'expected_traffic': expected_traffic
        }
        self._pre_warm_caches(key)
        self._pre_scale_partition(key)
    
    def check_social_media_trends(self):
        """Poll social media for URLs being shared."""
        # Check Twitter, Reddit, etc. for your domain
        # If a URL is trending, pre-warm it
        pass
    
    def _pre_warm_caches(self, key: str):
        """Load key into cache before traffic arrives."""
        print(f"Pre-warming caches for {key}")
    
    def _pre_scale_partition(self, key: str):
        """Add capacity to the partition owning this key."""
        print(f"Pre-scaling partition for {key}")

Chapter 3: Mitigation Strategies

Once you've detected a hot key, how do you handle it?

3.1 Strategy 1: Caching

The most common and effective strategy. If you can serve from cache, the database never sees the hot key traffic.

Request → Cache → [HIT] → Return (99% of hot key requests)
              → [MISS] → Database → Cache → Return (1%)

For read-heavy hot keys (most common):

class HotKeyCachingStrategy:
    def __init__(self, redis_client, db_client, hot_key_detector):
        self.redis = redis_client
        self.db = db_client
        self.detector = hot_key_detector
        
        # Different TTLs for hot vs normal keys
        self.normal_ttl = 300  # 5 minutes
        self.hot_key_ttl = 3600  # 1 hour (keep hot keys cached longer)
    
    def get(self, key: str) -> Optional[str]:
        # Try cache
        value = self.redis.get(key)
        if value:
            return value
        
        # Cache miss - get from DB
        value = self.db.get(key)
        if value:
            # Use longer TTL for hot keys
            ttl = self.hot_key_ttl if self.detector.is_hot(key) else self.normal_ttl
            self.redis.setex(key, ttl, value)
        
        return value

Problem: Cache stampede on hot key expiration

When a hot key expires, thousands of requests simultaneously hit the database.

Solution: Probabilistic early expiration

import random
import math

class StampedeProtectedCache:
    """
    Implements probabilistic early expiration to prevent cache stampede.
    Based on the XFetch algorithm.
    """
    
    def __init__(self, redis_client, db_client, beta: float = 1.0):
        self.redis = redis_client
        self.db = db_client
        self.beta = beta  # Controls early recomputation probability
    
    def get(self, key: str, ttl: int = 300) -> Optional[str]:
        # Get value with metadata
        data = self.redis.hgetall(f"cache:{key}")
        
        if not data:
            return self._recompute_and_cache(key, ttl)
        
        value = data.get('value')
        delta = float(data.get('delta', 1))  # Time to recompute
        expiry = float(data.get('expiry', 0))
        
        now = time.time()
        
        # Probabilistic early expiration
        # Higher probability of recompute as we approach expiry
        time_until_expiry = expiry - now
        
        if time_until_expiry <= 0:
            # Actually expired
            return self._recompute_and_cache(key, ttl)
        
        # XFetch formula: recompute if random < delta * beta * log(random)
        if time_until_expiry < delta * self.beta * (-math.log(random.random())):
            # Early recomputation (async, don't block)
            threading.Thread(
                target=self._recompute_and_cache,
                args=(key, ttl),
                daemon=True
            ).start()
        
        return value
    
    def _recompute_and_cache(self, key: str, ttl: int) -> Optional[str]:
        start = time.time()
        value = self.db.get(key)
        delta = time.time() - start  # How long recomputation took
        
        if value:
            expiry = time.time() + ttl
            self.redis.hset(f"cache:{key}", mapping={
                'value': value,
                'delta': delta,
                'expiry': expiry
            })
            self.redis.expire(f"cache:{key}", ttl + 60)  # Extra buffer
        
        return value

3.2 Strategy 2: Read Replicas for Hot Keys

Add read replicas specifically for hot partitions.

Normal traffic:
  Request → Partition Primary

Hot key traffic:
  Request → Load Balancer → [Hot Key Replica 1]
                          → [Hot Key Replica 2]
                          → [Hot Key Replica 3]

class HotKeyReplicaRouter:
    def __init__(self, partition_map, hot_key_replicas: dict[str, list]):
        self.partition_map = partition_map
        self.hot_key_replicas = hot_key_replicas  # key -> list of replica connections
    
    def get_connection(self, key: str):
        if key in self.hot_key_replicas:
            # Round-robin across hot key replicas
            replicas = self.hot_key_replicas[key]
            return random.choice(replicas)
        else:
            # Normal routing
            partition = self.partition_map.get_partition(key)
            return partition.get_replica()

When to use:

Hot keys are predictable (you know which ones)
You can afford dedicated infrastructure
Cache isn't sufficient (high write rate or consistency requirements)

3.3 Strategy 3: Key Splitting (Salting)

Split one logical hot key into multiple physical keys distributed across partitions.

Logical key: "viral_url_abc"

Physical keys:
  "viral_url_abc:0" → Partition 0
  "viral_url_abc:1" → Partition 1
  "viral_url_abc:2" → Partition 2
  ...
  "viral_url_abc:9" → Partition 9

Reads pick a random suffix; writes go to all suffixes.

class SplitKeyStrategy:
    def __init__(self, split_factor: int = 10):
        self.split_factor = split_factor
        self.hot_keys: set[str] = set()
    
    def mark_hot(self, key: str):
        """Mark a key as hot, enabling splitting."""
        self.hot_keys.add(key)
    
    def mark_cold(self, key: str):
        """Mark a key as no longer hot."""
        self.hot_keys.discard(key)
    
    def get_read_key(self, key: str) -> str:
        """Get the physical key for reading."""
        if key in self.hot_keys:
            suffix = random.randint(0, self.split_factor - 1)
            return f"{key}:{suffix}"
        return key
    
    def get_write_keys(self, key: str) -> list[str]:
        """Get all physical keys for writing."""
        if key in self.hot_keys:
            return [f"{key}:{i}" for i in range(self.split_factor)]
        return [key]
    
    def read(self, key: str, cache) -> Optional[str]:
        """Read from a random split."""
        physical_key = self.get_read_key(key)
        return cache.get(physical_key)
    
    def write(self, key: str, value: str, cache):
        """Write to all splits."""
        physical_keys = self.get_write_keys(key)
        for pk in physical_keys:
            cache.set(pk, value)

Trade-offs:

✅ Spreads load across partitions
❌ Writes amplified by split factor
❌ Updates require coordinating all splits
❌ Consistency challenges (stale reads during updates)

Best for: Read-heavy hot keys with infrequent updates (like viral URLs).

3.4 Strategy 4: Local Caching

Cache hot keys in application server memory, not just Redis.

Request → Local Cache → [HIT] → Return (fastest)
                     → [MISS] → Redis → [HIT] → Local Cache → Return
                                      → [MISS] → DB → Redis → Local Cache → Return

from functools import lru_cache
from cachetools import TTLCache
import threading

class TwoLevelCache:
    def __init__(self, redis_client, db_client, local_max_size: int = 1000, local_ttl: int = 10):
        self.redis = redis_client
        self.db = db_client
        
        # Local in-memory cache with TTL
        self.local_cache = TTLCache(maxsize=local_max_size, ttl=local_ttl)
        self.local_lock = threading.Lock()
        
        # Track hot keys for local caching
        self.hot_keys: set[str] = set()
    
    def get(self, key: str) -> Optional[str]:
        # Level 1: Local cache (only for hot keys)
        if key in self.hot_keys:
            with self.local_lock:
                if key in self.local_cache:
                    return self.local_cache[key]
        
        # Level 2: Redis
        value = self.redis.get(key)
        if value:
            if key in self.hot_keys:
                with self.local_lock:
                    self.local_cache[key] = value
            return value
        
        # Level 3: Database
        value = self.db.get(key)
        if value:
            self.redis.setex(key, 300, value)
            if key in self.hot_keys:
                with self.local_lock:
                    self.local_cache[key] = value
        
        return value
    
    def mark_hot(self, key: str):
        self.hot_keys.add(key)
    
    def invalidate(self, key: str):
        """Invalidate across all levels."""
        with self.local_lock:
            self.local_cache.pop(key, None)
        self.redis.delete(key)

Challenge: Consistency across application servers

Each server has its own local cache. When data changes, all caches must be invalidated.

Solution: Pub/Sub invalidation

class DistributedLocalCache:
    def __init__(self, redis_client, channel: str = "cache_invalidation"):
        self.redis = redis_client
        self.channel = channel
        self.local_cache = TTLCache(maxsize=1000, ttl=10)
        
        # Subscribe to invalidation messages
        self._start_subscriber()
    
    def _start_subscriber(self):
        def listen():
            pubsub = self.redis.pubsub()
            pubsub.subscribe(self.channel)
            
            for message in pubsub.listen():
                if message['type'] == 'message':
                    key = message['data']
                    self.local_cache.pop(key, None)
        
        threading.Thread(target=listen, daemon=True).start()
    
    def invalidate(self, key: str):
        """Invalidate locally and broadcast to other servers."""
        self.local_cache.pop(key, None)
        self.redis.publish(self.channel, key)

3.5 Strategy 5: Request Coalescing

Multiple concurrent requests for the same key get merged into one database query.

Time 0: Request A for key "hot" arrives → Start DB query
Time 1: Request B for key "hot" arrives → Wait for A's result
Time 2: Request C for key "hot" arrives → Wait for A's result
Time 3: DB query completes → Return to A, B, C simultaneously

import asyncio
from typing import Dict, Optional
from collections import defaultdict

class RequestCoalescer:
    """
    Coalesce multiple concurrent requests for the same key
    into a single database query.
    """
    
    def __init__(self, db_client):
        self.db = db_client
        self.pending: Dict[str, asyncio.Future] = {}
        self.lock = asyncio.Lock()
    
    async def get(self, key: str) -> Optional[str]:
        async with self.lock:
            if key in self.pending:
                # Another request is already fetching this key
                # Wait for its result
                return await self.pending[key]
            
            # We're the first - create a future for others to wait on
            future = asyncio.Future()
            self.pending[key] = future
        
        try:
            # Fetch from database
            value = await self.db.get(key)
            future.set_result(value)
            return value
        except Exception as e:
            future.set_exception(e)
            raise
        finally:
            async with self.lock:
                del self.pending[key]


# Synchronous version using threading
class SyncRequestCoalescer:
    def __init__(self, db_client):
        self.db = db_client
        self.pending: Dict[str, threading.Event] = {}
        self.results: Dict[str, any] = {}
        self.lock = threading.Lock()
    
    def get(self, key: str) -> Optional[str]:
        with self.lock:
            if key in self.pending:
                event = self.pending[key]
            else:
                event = threading.Event()
                self.pending[key] = event
                
                # We're the leader - start the fetch
                threading.Thread(
                    target=self._fetch,
                    args=(key, event),
                    daemon=True
                ).start()
        
        # Wait for result
        event.wait(timeout=5.0)
        
        with self.lock:
            return self.results.get(key)
    
    def _fetch(self, key: str, event: threading.Event):
        try:
            value = self.db.get(key)
            with self.lock:
                self.results[key] = value
        finally:
            event.set()
            # Cleanup after a delay to handle late arrivals
            time.sleep(0.1)
            with self.lock:
                self.pending.pop(key, None)
                self.results.pop(key, None)

3.6 Strategy Comparison

Strategy	Best For	Complexity	Consistency	Write Impact
Caching	Read-heavy, tolerance for stale	Low	Eventual	None
Read Replicas	Predictable hot keys	Medium	Depends on replication	None
Key Splitting	Read-heavy, infrequent writes	Medium	Weak	High (fan-out)
Local Caching	Ultra-low latency needed	Medium	Weak	Medium (invalidation)
Request Coalescing	Cache misses, cold starts	Low	Strong	None

Part II: The Design Challenge

Chapter 4: Redesigning the URL Shortener for Skewed Traffic

We return to our URL shortener from Days 1-3. New requirement:

0.01% of URLs receive 90% of traffic

With 1 billion URLs, that's 100,000 hot URLs receiving most of the 10,000 requests/second.

4.1 The Problem Illustrated

Current architecture:

URLs distributed by hash(short_code) across 10 partitions
Each partition: ~100M URLs, ~1000 req/sec average

With hot keys:
  Partition 3: Has 10 hot URLs → 5000 req/sec (5x average)
  Partition 7: Has 50 hot URLs → 7000 req/sec (7x average)
  Other partitions: 100-300 req/sec

Partitions 3 and 7 are overloaded. Their latency spikes. Users experience timeouts.

4.2 Solution Architecture

We'll implement a multi-layer defense:

                                    ┌─────────────────────────────────────────────┐
                                    │              Request Flow                    │
                                    └─────────────────────────────────────────────┘
                                                         │
                                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│  Layer 1: CDN Edge Cache                                                             │
│  - Cache static redirects at edge                                                    │
│  - ~80% of hot key traffic served here                                              │
└────────────────────────────────────────┬────────────────────────────────────────────┘
                                         │ Cache miss
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│  Layer 2: Application Server (Local Cache + Hot Key Detection)                       │
│  - Local in-memory cache for hot keys                                               │
│  - Hot key detector marks trending URLs                                             │
│  - ~15% of remaining traffic served here                                            │
└────────────────────────────────────────┬────────────────────────────────────────────┘
                                         │ Cache miss
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│  Layer 3: Redis Cluster (Distributed Cache)                                          │
│  - Split keys for known hot URLs                                                    │
│  - Request coalescing for concurrent misses                                         │
│  - ~4% of remaining traffic served here                                             │
└────────────────────────────────────────┬────────────────────────────────────────────┘
                                         │ Cache miss (~1%)
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│  Layer 4: Database (Partitioned PostgreSQL)                                          │
│  - Read replicas for hot partitions                                                 │
│  - Only cold keys and cache misses reach here                                       │
└─────────────────────────────────────────────────────────────────────────────────────┘

4.3 Implementation

from dataclasses import dataclass
from typing import Optional, Set
import random
import time
import threading
import redis
import psycopg2

@dataclass
class URLMapping:
    short_code: str
    original_url: str
    created_at: float
    is_hot: bool = False

class SkewResistantURLShortener:
    """
    URL shortener designed to handle highly skewed traffic.
    0.01% of URLs can receive 90% of traffic without degradation.
    """
    
    def __init__(
        self,
        redis_cluster: redis.RedisCluster,
        db_pool: psycopg2.pool.ThreadedConnectionPool,
        hot_key_split_factor: int = 10
    ):
        self.redis = redis_cluster
        self.db_pool = db_pool
        self.split_factor = hot_key_split_factor
        
        # Layer 2: Local cache for hot keys
        self.local_cache: dict[str, tuple[str, float]] = {}  # key -> (value, expiry)
        self.local_cache_lock = threading.Lock()
        self.local_cache_ttl = 5  # seconds
        
        # Hot key tracking
        self.hot_keys: Set[str] = set()
        self.hot_key_detector = CMSHotKeyDetector(
            hot_threshold=1000,  # 1000 req/min = hot
            window_seconds=60
        )
        
        # Request coalescing
        self.coalescer = SyncRequestCoalescer(self)
        
        # Metrics
        self.stats = {
            'local_hits': 0,
            'redis_hits': 0,
            'db_hits': 0,
            'total': 0
        }
    
    def redirect(self, short_code: str) -> Optional[str]:
        """
        Get the original URL for a short code.
        Multi-layer caching with hot key optimization.
        """
        self.stats['total'] += 1
        
        # Track for hot key detection
        is_hot = self.hot_key_detector.record(short_code)
        if is_hot and short_code not in self.hot_keys:
            self._promote_to_hot(short_code)
        
        # Layer 2: Check local cache (hot keys only)
        if short_code in self.hot_keys:
            url = self._check_local_cache(short_code)
            if url:
                self.stats['local_hits'] += 1
                return url
        
        # Layer 3: Check Redis (with split keys for hot keys)
        url = self._check_redis(short_code)
        if url:
            self.stats['redis_hits'] += 1
            self._update_local_cache(short_code, url)
            return url
        
        # Layer 4: Database (with coalescing)
        url = self.coalescer.get(short_code)
        if url:
            self.stats['db_hits'] += 1
            self._update_redis(short_code, url)
            self._update_local_cache(short_code, url)
        
        return url
    
    def _check_local_cache(self, key: str) -> Optional[str]:
        """Check local in-memory cache."""
        with self.local_cache_lock:
            if key in self.local_cache:
                value, expiry = self.local_cache[key]
                if time.time() < expiry:
                    return value
                else:
                    del self.local_cache[key]
        return None
    
    def _update_local_cache(self, key: str, value: str):
        """Update local cache for hot keys."""
        if key in self.hot_keys:
            with self.local_cache_lock:
                self.local_cache[key] = (value, time.time() + self.local_cache_ttl)
    
    def _check_redis(self, key: str) -> Optional[str]:
        """Check Redis, using split keys for hot keys."""
        if key in self.hot_keys:
            # Read from random split
            suffix = random.randint(0, self.split_factor - 1)
            redis_key = f"url:{key}:{suffix}"
        else:
            redis_key = f"url:{key}"
        
        return self.redis.get(redis_key)
    
    def _update_redis(self, key: str, value: str):
        """Update Redis, writing to all splits for hot keys."""
        if key in self.hot_keys:
            # Write to all splits
            pipe = self.redis.pipeline()
            for i in range(self.split_factor):
                pipe.setex(f"url:{key}:{i}", 3600, value)  # 1 hour TTL
            pipe.execute()
        else:
            self.redis.setex(f"url:{key}", 300, value)  # 5 min TTL
    
    def _db_get(self, key: str) -> Optional[str]:
        """Fetch from database."""
        conn = self.db_pool.getconn()
        try:
            with conn.cursor() as cur:
                cur.execute(
                    "SELECT original_url FROM urls WHERE short_code = %s",
                    (key,)
                )
                result = cur.fetchone()
                return result[0] if result else None
        finally:
            self.db_pool.putconn(conn)
    
    def _promote_to_hot(self, key: str):
        """Promote a key to hot status."""
        print(f"🔥 Promoting {key} to hot key")
        self.hot_keys.add(key)
        
        # Pre-populate all Redis splits
        url = self.redis.get(f"url:{key}") or self._db_get(key)
        if url:
            self._update_redis(key, url)
    
    def _demote_from_hot(self, key: str):
        """Demote a key from hot status."""
        print(f"❄️ Demoting {key} from hot key")
        self.hot_keys.discard(key)
        
        # Clean up local cache
        with self.local_cache_lock:
            self.local_cache.pop(key, None)
        
        # Clean up split keys (keep one)
        for i in range(1, self.split_factor):
            self.redis.delete(f"url:{key}:{i}")
    
    def get_stats(self) -> dict:
        """Return cache hit statistics."""
        total = self.stats['total'] or 1
        return {
            'total_requests': total,
            'local_hit_rate': self.stats['local_hits'] / total,
            'redis_hit_rate': self.stats['redis_hits'] / total,
            'db_hit_rate': self.stats['db_hits'] / total,
            'hot_key_count': len(self.hot_keys)
        }

4.4 CDN Configuration for Hot URLs

Configure your CDN to cache redirects:

# Cloudflare Worker for URL shortener caching
WORKER_CODE = """
addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

async function handleRequest(request) {
  const url = new URL(request.url)
  const shortCode = url.pathname.slice(1)  // Remove leading /
  
  // Check edge cache
  const cache = caches.default
  const cacheKey = new Request(url.toString(), request)
  let response = await cache.match(cacheKey)
  
  if (response) {
    // Cache hit - add header for debugging
    response = new Response(response.body, response)
    response.headers.set('X-Cache', 'HIT')
    return response
  }
  
  // Cache miss - fetch from origin
  response = await fetch(request)
  
  // Cache successful redirects
  if (response.status === 301 || response.status === 302) {
    const cacheResponse = response.clone()
    
    // Cache for 1 hour at edge
    const headers = new Headers(cacheResponse.headers)
    headers.set('Cache-Control', 'public, max-age=3600')
    
    const cachedResponse = new Response(cacheResponse.body, {
      status: cacheResponse.status,
      statusText: cacheResponse.statusText,
      headers: headers
    })
    
    event.waitUntil(cache.put(cacheKey, cachedResponse))
  }
  
  response.headers.set('X-Cache', 'MISS')
  return response
}
"""

4.5 Traffic Analysis

Let's analyze the traffic flow with our new architecture:

Total traffic: 10,000 req/sec

Without hot key handling:
  All 10,000 req/sec hit Redis/DB
  Hot partitions: 5,000-7,000 req/sec each
  
With hot key handling:
  Layer 1 (CDN): 80% of hot key traffic = ~7,200 req/sec served at edge
  Layer 2 (Local): 15% of remaining = ~420 req/sec served locally
  Layer 3 (Redis): 4% of remaining = ~95 req/sec to Redis
  Layer 4 (DB): 1% of remaining = ~25 req/sec to DB (cache misses)
  
  Remaining traffic to process: ~2,800 req/sec (vs 10,000)
  Distributed evenly across partitions: ~280 req/sec each
  
  Latency improvement:
    Before: P99 = 500ms (hot partition overloaded)
    After:  P99 = 20ms (load distributed)

Chapter 5: How Instagram Handles Hot Celebrity Posts

Let's study a real-world system dealing with extreme hot keys.

5.1 The Scale of the Problem

Instagram's challenges:

Cristiano Ronaldo has 600M+ followers
A post from him generates millions of interactions in minutes
Showing his post in 600M feeds requires massive fan-out
Like counts update millions of times per hour

5.2 Feed Generation: Hybrid Push/Pull

Instagram uses different strategies for different account sizes:

For regular users (< 10K followers): Push on write

User posts → For each follower: Add to their feed cache

Pre-computed feeds
Fast reads
Writes fan out to thousands, not millions

For celebrities (> 10K followers): Pull on read

User opens app → Fetch posts from followed celebrities → Merge with pre-computed feed

No massive write fan-out
More work at read time
Celebrity posts fetched on-demand

class InstagramFeedStrategy:
    CELEBRITY_THRESHOLD = 10000
    
    def __init__(self, feed_cache, post_store):
        self.feed_cache = feed_cache
        self.post_store = post_store
    
    def on_new_post(self, author_id: str, post_id: str, follower_count: int):
        if follower_count < self.CELEBRITY_THRESHOLD:
            # Push to all followers' feeds
            self._push_to_followers(author_id, post_id)
        else:
            # Just store the post - will be pulled on read
            self.post_store.add_celebrity_post(author_id, post_id)
    
    def get_feed(self, user_id: str, followed_celebrities: list[str]) -> list[str]:
        # Get pre-computed feed
        feed = self.feed_cache.get_feed(user_id)
        
        # Merge celebrity posts
        for celeb_id in followed_celebrities:
            celeb_posts = self.post_store.get_recent_posts(celeb_id)
            feed = self._merge_by_time(feed, celeb_posts)
        
        return feed[:100]  # Top 100 posts

5.3 Like Counters: Approximate Counting

Showing exact like counts for viral posts is expensive. Instagram uses approximation:

class ApproximateLikeCounter:
    """
    Approximate counting for high-volume counters.
    Shows "1.2M likes" instead of "1,234,567 likes".
    """
    
    def __init__(self, redis_client):
        self.redis = redis_client
        
        # Exact counts for small numbers
        self.exact_threshold = 10000
        
        # Sampling rate for large counts
        self.sample_rate = 0.01  # Count 1 in 100
    
    def increment(self, post_id: str):
        current = self.get_count(post_id)
        
        if current < self.exact_threshold:
            # Exact counting for small numbers
            self.redis.incr(f"likes:{post_id}")
        else:
            # Probabilistic increment for large numbers
            if random.random() < self.sample_rate:
                self.redis.incrby(f"likes:{post_id}", int(1 / self.sample_rate))
    
    def get_count(self, post_id: str) -> int:
        return int(self.redis.get(f"likes:{post_id}") or 0)
    
    def get_display_count(self, post_id: str) -> str:
        count = self.get_count(post_id)
        
        if count < 1000:
            return str(count)
        elif count < 1000000:
            return f"{count // 1000}K"
        else:
            return f"{count / 1000000:.1f}M"

5.4 Sharding by Post ID with Hot Shard Detection

class PostShardManager:
    """
    Manages post data across shards with hot shard mitigation.
    """
    
    def __init__(self, num_shards: int = 100):
        self.num_shards = num_shards
        self.shard_load: dict[int, int] = defaultdict(int)
        self.hot_posts: set[str] = set()
    
    def get_shard(self, post_id: str) -> int:
        return hash(post_id) % self.num_shards
    
    def record_access(self, post_id: str):
        shard = self.get_shard(post_id)
        self.shard_load[shard] += 1
        
        # Detect hot shards
        avg_load = sum(self.shard_load.values()) / self.num_shards
        if self.shard_load[shard] > avg_load * 5:
            self._handle_hot_shard(shard)
    
    def _handle_hot_shard(self, shard: int):
        # Options:
        # 1. Add read replicas for this shard
        # 2. Migrate some data to less loaded shards
        # 3. Cache aggressively for this shard
        print(f"⚠️ Hot shard detected: {shard}")

5.5 Key Lessons from Instagram

Differentiate by popularity: Different strategies for different scales
Accept approximation: Users don't need exact counts
Defer work: Pull on read for expensive fan-outs
Layer caching: CDN → App cache → Redis → Database
Monitor and adapt: Real-time hot key detection triggers different code paths

Part III: Advanced Topics

Chapter 6: Write-Heavy Hot Keys

So far we've focused on read-heavy hot keys. Write-heavy hot keys are harder.

6.1 The Challenge

Scenario: Live view counter for a viral video
  - 1 million concurrent viewers
  - Counter incremented every second per viewer
  - 1M writes/second to one key

You can't cache writes. Every increment must be recorded.

6.2 Buffering and Batching

Buffer writes locally, flush periodically:

class BufferedCounter:
    """
    Buffer increments locally, flush to central store periodically.
    Trades some accuracy and latency for throughput.
    """
    
    def __init__(self, redis_client, flush_interval: float = 1.0):
        self.redis = redis_client
        self.flush_interval = flush_interval
        
        self.local_counts: dict[str, int] = defaultdict(int)
        self.lock = threading.Lock()
        
        self._start_flusher()
    
    def increment(self, key: str, amount: int = 1):
        with self.lock:
            self.local_counts[key] += amount
    
    def _flush(self):
        with self.lock:
            counts_to_flush = dict(self.local_counts)
            self.local_counts.clear()
        
        if counts_to_flush:
            pipe = self.redis.pipeline()
            for key, count in counts_to_flush.items():
                pipe.incrby(f"counter:{key}", count)
            pipe.execute()
    
    def _start_flusher(self):
        def flush_loop():
            while True:
                time.sleep(self.flush_interval)
                self._flush()
        
        threading.Thread(target=flush_loop, daemon=True).start()

Trade-off: Counter is eventually consistent. During a server crash, buffered counts are lost.

6.3 Distributed Counting with CRDTs

Use a G-Counter (Grow-only Counter) CRDT:

class GCounter:
    """
    CRDT G-Counter: Each node maintains its own count.
    Total = sum of all node counts.
    """
    
    def __init__(self, node_id: str, redis_client):
        self.node_id = node_id
        self.redis = redis_client
    
    def increment(self, key: str, amount: int = 1):
        # Increment only this node's counter
        self.redis.hincrby(f"gcounter:{key}", self.node_id, amount)
    
    def get_total(self, key: str) -> int:
        # Sum all nodes' counters
        counts = self.redis.hgetall(f"gcounter:{key}")
        return sum(int(v) for v in counts.values())
    
    def merge(self, key: str, other_counts: dict[str, int]):
        # Merge by taking max of each node's count
        pipe = self.redis.pipeline()
        for node_id, count in other_counts.items():
            # HSET with max
            current = int(self.redis.hget(f"gcounter:{key}", node_id) or 0)
            if count > current:
                pipe.hset(f"gcounter:{key}", node_id, count)
        pipe.execute()

Each server only writes to its own hash field—no contention!

6.4 Sharded Counters

Split counter across multiple keys:

class ShardedCounter:
    """
    Shard a counter across multiple keys to reduce contention.
    """
    
    def __init__(self, redis_client, num_shards: int = 100):
        self.redis = redis_client
        self.num_shards = num_shards
    
    def increment(self, key: str, amount: int = 1):
        # Pick a random shard
        shard = random.randint(0, self.num_shards - 1)
        self.redis.incrby(f"counter:{key}:{shard}", amount)
    
    def get_count(self, key: str) -> int:
        # Sum all shards
        pipe = self.redis.pipeline()
        for i in range(self.num_shards):
            pipe.get(f"counter:{key}:{i}")
        
        results = pipe.execute()
        return sum(int(r or 0) for r in results)

Trade-off: Reads are O(num_shards) instead of O(1). Use when writes >> reads.

Chapter 7: Hot Keys in Different Systems

7.1 Hot Keys in Kafka

Kafka partitions messages by key. Hot keys cause partition imbalance.

# Problem: All messages for user "viral_celebrity" go to one partition
producer.send('events', key='viral_celebrity', value=event)

# Solution 1: Add random suffix for known hot keys
if user_id in known_hot_users:
    partition_key = f"{user_id}:{random.randint(0, 9)}"
else:
    partition_key = user_id

producer.send('events', key=partition_key, value=event)

# Solution 2: Use null key for hot events (round-robin partitioning)
if is_hot_event(event):
    producer.send('events', key=None, value=event)
else:
    producer.send('events', key=event.user_id, value=event)

7.2 Hot Keys in DynamoDB

DynamoDB throttles partitions, not tables. Hot partition = throttled requests.

# Problem: All items for "global_config" on one partition
table.get_item(Key={'pk': 'global_config', 'sk': 'settings'})

# Solution 1: Write-sharding with scatter-gather reads
def get_global_config():
    shards = [f"global_config#{i}" for i in range(10)]
    shard = random.choice(shards)
    return table.get_item(Key={'pk': shard, 'sk': 'settings'})

# Solution 2: DAX (DynamoDB Accelerator) for read caching
dax_client = amazondax.AmazonDaxClient.resource(endpoint_url='...')
dax_table = dax_client.Table('my-table')
# Hot reads served from cache

7.3 Hot Keys in Elasticsearch

One popular document can make its shard a bottleneck.

# Solution: Route hot documents to dedicated indices
def get_index_for_document(doc_id: str, is_hot: bool) -> str:
    if is_hot:
        return 'hot_documents'  # Index with more shards/replicas
    else:
        return 'documents'

# Hot index configuration
{
    "settings": {
        "number_of_shards": 10,  # More shards = more parallelism
        "number_of_replicas": 5  # More replicas = handle more reads
    }
}

Part IV: Discussion and Trade-offs

Chapter 8: The Hard Questions

8.1 "Redesign URL shortener assuming 0.01% of URLs get 90% of traffic"

Complete answer:

"The core insight is that we need different handling for hot vs cold URLs.

For the 99.99% of cold URLs, our existing architecture works fine—hash partition by short_code, standard caching.

For the 0.01% hot URLs, we implement:

Detection: Count-Min Sketch tracks request frequency per URL with fixed memory. When a URL exceeds threshold (e.g., 1000 req/min), mark it hot.
Multi-layer caching:
- CDN edge: Cache redirects for 1 hour. 80%+ of hot traffic served here.
- Local app cache: 5-second TTL for hot URLs. Fast, no network hop.
- Redis with key splitting: Hot URLs split across 10 keys, spreading load.
Request coalescing: When cache expires and multiple requests hit simultaneously, only one fetches from DB; others wait.
Graceful promotion/demotion: URLs smoothly transition between hot and cold status. Promotion pre-warms caches; demotion cleans up split keys.

The result: hot URLs are served from memory with sub-millisecond latency. Database sees only cache misses—maybe 1% of total traffic.

Trade-offs I'm accepting:

Slightly stale data for hot URLs (TTL-based caching)
Memory overhead for tracking hot keys
Complexity in the caching layer"

8.2 "How does Instagram handle hot celebrity posts?"

Complete answer:

"Instagram faces an extreme version of the hot key problem. A celebrity with 600M followers posting creates three challenges: feed distribution, interaction counting, and content serving.

Feed distribution: They use a hybrid push-pull model. For regular users (<10K followers), posts are pushed to followers' pre-computed feeds. For celebrities, posts are not fanned out—instead, they're pulled at read time and merged into the feed. This trades read latency for write scalability.

Interaction counting: Exact counts are unnecessary and expensive. They use approximate counting—probabilistically sample increments when counts exceed a threshold. Displaying '1.2M likes' doesn't need precision.

Content serving: Multi-layer caching with CDN at the edge. A viral photo is cached globally. The origin serves a tiny fraction of requests.

Key architectural insight: Celebrity accounts are fundamentally different entities in the system. They have different code paths, different storage strategies, even different SLAs. Instead of trying to make one system handle all scales, they explicitly bifurcate the architecture."

8.3 "How do you decide between caching, sharding, and replication for hot keys?"

Decision framework:

Characteristic	Best Strategy
Read-heavy, tolerance for stale	Caching (simplest, most effective)
Read-heavy, needs freshness	Read replicas
Write-heavy	Sharded counters or batching
Predictable hot keys	Pre-provision dedicated resources
Unpredictable hot keys	Automatic detection + dynamic mitigation
Need strong consistency	Replication + careful invalidation

"I'd start with caching—it solves 90% of hot key problems. Add replication for capacity. Use sharding only when writes are the bottleneck. The order matters because each adds complexity."

Chapter 9: Session Summary

What You Should Know Now

After this session, you should be able to:

Explain why hot keys happen (Zipf's law, viral content, temporal patterns)
Detect hot keys using streaming algorithms (Count-Min Sketch) and monitoring
Apply multiple mitigation strategies (caching, splitting, replication, coalescing)
Design systems for skewed workloads (URL shortener with 0.01% hot keys)
Analyze real-world systems (Instagram's celebrity handling)

Key Trade-offs to Remember

Decision	Trade-off
More caching	Lower latency vs Stale data
Key splitting	Distributed load vs Write amplification
Local caching	Fastest reads vs Consistency challenges
Approximate counting	Scalability vs Precision
Hybrid push/pull	Write scalability vs Read complexity

Questions to Ask in Every Design

What's the expected traffic distribution? (Uniform? Zipf?)
Which keys might go viral? Can we predict them?
What's acceptable staleness for hot keys?
How do we detect when a key becomes hot?
What's the fallback if our hot key mitigation fails?

Part V: Interview Questions and Answers

Chapter 10: Real-World Interview Scenarios

10.1 Conceptual Questions

Question 1: "What is a hot key and why is it a problem?"

Interviewer's Intent: Testing fundamental understanding.

Strong Answer:

"A hot key is a single key that receives a disproportionate share of traffic—often thousands of times more than the average key. It's a problem because in distributed systems, keys are typically partitioned across nodes. All requests for a hot key go to one node, creating a bottleneck while other nodes sit idle.

The impact is severe: the hot partition becomes overloaded, latency spikes, and if health checks fail, it can cascade to other nodes. I've seen a single viral URL take down an entire URL shortener because it overwhelmed its partition.

The underlying cause is usually Zipf's law—in most systems, a small fraction of items (1%) get most of the traffic (80%+). Viral content, celebrity accounts, and popular products are common examples."

Question 2: "Explain the Count-Min Sketch and why it's useful for hot key detection."

Interviewer's Intent: Testing knowledge of probabilistic data structures.

Strong Answer:

"Count-Min Sketch is a probabilistic data structure for frequency estimation with fixed memory. It uses a 2D array of counters and multiple hash functions.

To increment: hash the key with each function, increment the counter at each resulting position. To query: hash the key, return the minimum of all counters (hence 'min').

It has interesting properties:

May overestimate counts (hash collisions add to counts)
Never underestimates
Memory is fixed regardless of unique key count
Accuracy improves with more hash functions and wider array

For hot key detection, it's perfect because:

We need approximate counts, not exact (just need to know if key is 'hot')
We can't store timestamps for every request (would need unbounded memory)
False positives are acceptable (treating a warm key as hot is fine)
False negatives are rare (we won't miss a truly hot key)

I'd use it as a first filter, then do exact counting only for keys that pass the threshold."

Question 3: "Compare different strategies for handling write-heavy hot keys."

Interviewer's Intent: Testing depth of knowledge.

Strong Answer:

"Write-heavy hot keys are harder than read-heavy because you can't cache writes. Three main strategies:

Buffering/Batching: Each app server buffers writes locally, flushes periodically to the database. A like counter might buffer for 1 second before sending an INCRBY for the sum. Trade-off: Potential data loss if server crashes, latency in seeing updates.

Sharded Counters: Split one counter into N counters across partitions. Writes go to a random shard; reads sum all shards. Trade-off: Reads become O(N) instead of O(1). Best when writes vastly exceed reads.

CRDTs (G-Counters): Each node maintains its own count. Total is the sum. Nodes can increment independently without coordination—perfect for distributed counting. Trade-off: More storage (one field per node), requires aggregation for reads.

For a live viewer count with 1M updates/second, I'd use sharded counters with maybe 1000 shards. That's 1000 updates/second per shard—manageable. Reads can be cached since they don't need to be perfectly real-time."

10.2 Design Questions

Interviewer's Intent: Testing end-to-end design for hot keys.

Strong Answer:

"Trending hashtags are the epitome of hot keys—previously unknown hashtags suddenly get millions of tweets.

Detection layer: Stream processing (Kafka + Flink) counts hashtag occurrences in sliding windows. Compare current rate to historical baseline. Hashtag with 10x normal rate → potentially trending.

Storage strategy: Two tiers:

Cold hashtags: Standard partitioning by hashtag, normal caching
Hot/trending hashtags: Dedicated 'trending' partition with extra replicas, aggressive caching, sharded counters for counts

Counting: Since we're displaying approximate counts ('10.2K tweets'), use probabilistic counting. Sample 1 in 100 tweets for very hot hashtags. HyperLogLog for unique user counts.

Serving: The 'trending' page itself is a hot key. Cache it at CDN edge with 30-second TTL. Different regions can show slightly different trending lists—eventual consistency is fine.

Automatic scaling: When a hashtag crosses threshold, automatically:

Add it to local caches on all app servers
Create sharded counters
Route to dedicated infrastructure
Alert ops team

When it cools down (rate drops below threshold for 1 hour), demote back to normal handling.

Key insight: Treat trending hashtags as a special class with dedicated infrastructure, not as regular hashtags that happen to be popular."

Question 5: "Design a view counter for YouTube videos."

Interviewer's Intent: Testing design for extreme write volumes.

Strong Answer:

"YouTube's challenge: Some videos get millions of views per hour. Every view needs counting.

Requirements clarification:

Eventual consistency is fine (view count can lag)
Approximate counts acceptable for display (1.2M not 1,234,567)
Need accurate counts for monetization (but can be delayed)

Architecture:

View event → Kafka → Stream Processor → Real-time Counter
                  │
                  └→ Batch Processor → Accurate Counter (hourly)

Real-time path (for display):

Events buffered per app server, flushed every second
Sharded counters in Redis (1000 shards for hot videos)
Read aggregates all shards, caches result for 10 seconds

Batch path (for monetization):

All events stored in data lake
Hourly job computes exact counts with deduplication
These are the 'official' counts for revenue calculation

Hot video detection:

Count-Min Sketch identifies videos exceeding threshold
Hot videos get promoted to dedicated counter infrastructure
Separate Redis cluster for just hot videos

Display optimization:

Cache view counts at CDN
Format as '1.2M views' (no need for precision)
Update cached count every 30 seconds

Trade-offs:

Real-time count may differ from batch count (eventual consistency)
Display count is approximate (user expectation managed via formatting)
Infrastructure cost scales with hot video count, not total video count"

10.3 Scenario-Based Questions

Question 6: "Your database partition is getting hot due to a viral product. What do you do?"

Interviewer's Intent: Testing incident response.

Strong Answer:

"Immediate triage:

Confirm the issue: Check metrics—is it actually one product causing the load? Look at query patterns, partition load metrics.
Immediate mitigation:
- Add the product to application-level cache with longer TTL
- If possible, route reads to replicas
- Consider temporarily reducing cache TTL for other products (free up cache space)
If still overloaded:
- Add read replicas to the hot partition specifically
- Implement request rate limiting for that product's API endpoints
- Last resort: return slightly stale data (serve from cache even if expired, backfill async)

After immediate crisis:

Longer-term fixes:
- Implement hot key detection to catch this automatically next time
- Add infrastructure for hot key handling (key splitting, dedicated caches)
- Set up alerts for partition load imbalance
Post-incident review:
- Why didn't we detect this before it caused problems?
- How can we predict viral products? (Marketing coordination, trending detection)
- What's the cost of permanent hot-key infrastructure vs occasional incidents?

The key is layered response: immediate cache, then replicas, then architectural changes. Don't over-engineer for every possible hot key, but have a playbook ready."

Question 7: "How would you test your hot key handling?"

Interviewer's Intent: Testing quality and operational thinking.

Strong Answer:

"Testing hot keys requires both synthetic tests and production validation.

Unit tests: Test individual components

def test_count_min_sketch_detects_hot_key():
    cms = CountMinSketch()
    for _ in range(10000):
        cms.increment('hot_key')
    for _ in range(10):
        cms.increment('cold_key')
    
    assert cms.estimate('hot_key') > 9000
    assert cms.estimate('cold_key') < 100

def test_key_splitting_distributes_load():
    splitter = SplitKeyStrategy(split_factor=10)
    splitter.mark_hot('hot_key')
    
    # Verify reads distribute across splits
    splits_hit = set()
    for _ in range(1000):
        key = splitter.get_read_key('hot_key')
        splits_hit.add(key.split(':')[-1])
    
    assert len(splits_hit) == 10  # All splits used

Load tests: Simulate hot key traffic

def test_system_handles_hot_key():
    # Generate traffic: 1000 req/sec to one key, 10 req/sec to others
    # Verify:
    # - Latency stays under SLA (p99 < 100ms)
    # - No errors
    # - Hot key detected within 60 seconds
    # - Mitigation activates automatically

Chaos tests: Inject failures during hot key handling

Kill cache node holding hot key
Introduce latency in hot key detection
Verify graceful degradation

Production testing:

Shadow mode: Log what would happen, don't actually activate
Canary: Route 1% of traffic through hot key path
Game day: Artificially create hot key (with rate limits) and verify response

Monitoring to validate:

Hot key detection latency (time from hot to detected)
Mitigation activation time
False positive rate (keys marked hot that weren't)
Cache hit rate during hot key events"

10.4 Deep-Dive Questions

Question 8: "How do different databases handle hot partitions?"

Interviewer's Intent: Testing breadth of knowledge.

Strong Answer:

"Different databases have different approaches:

DynamoDB: Adaptive capacity automatically shifts throughput to hot partitions. But there's still a per-partition limit (~3000 RCU). For sustained hot keys, you need application-level handling like write sharding.

Cassandra: Token-aware clients can route around hot nodes, but fundamentally one token range = one set of replicas. You can add replicas or split the token range. Cassandra 4.0+ has improved hot partition handling with better compaction.

MongoDB: With sharded clusters, you can split hot chunks. MongoDB Atlas has auto-scaling based on load. But the shard key choice determines how traffic distributes—a bad shard key can't be fixed by auto-scaling.

Redis Cluster: Each slot has a primary + replicas. Hot slots can have reads served from replicas. But for write-hot keys, you're limited to what one node can handle. Key splitting at the application level is necessary.

CockroachDB/Spanner: Automatic range splitting when a range gets too hot. This is one of their selling points—the database handles hot spots transparently. But there are still limits per range.

Common theme: All databases have some hot key handling, but application-level mitigation is usually necessary for extreme cases. The database can buy you time; it can't solve the fundamental problem of one logical key = one location."

Question 9: "Explain how you'd implement automatic hot key detection and mitigation."

Interviewer's Intent: Testing system design depth.

Strong Answer:

"I'd build it as a service that plugs into the request path:

Request → Hot Key Service → Application → Response
              │
              └→ Prometheus (metrics)
              └→ Action Triggers

Detection pipeline:

Streaming counter: Count-Min Sketch updated on every request. Fixed memory, O(1) updates.
Threshold checker: Every second, scan for keys exceeding threshold. Use approximate top-k algorithm (Space-Saving or Lossy Counting).
Confirmation: Keys exceeding CMS threshold get promoted to exact counting for 10 seconds. If exact count confirms, key is 'hot'.
Classification: Determine hot key type:
- Read-heavy: 90%+ reads → caching strategy
- Write-heavy: 90%+ writes → sharding strategy
- Mixed: hybrid approach

Mitigation actions:

class HotKeyMitigator:
    def on_hot_key_detected(self, key: str, key_type: str):
        if key_type == 'read_heavy':
            self.enable_aggressive_caching(key)
            self.enable_local_caching(key)
            self.notify_cdn_to_cache(key)
        elif key_type == 'write_heavy':
            self.enable_key_splitting(key)
            self.enable_write_buffering(key)
        
        self.alert_ops_team(key)
    
    def on_hot_key_cooled(self, key: str):
        self.disable_aggressive_caching(key)
        self.cleanup_split_keys(key)
        self.alert_ops_team_resolved(key)

Observability:

Dashboard: Current hot keys, their type, mitigation active
Alerts: New hot key detected, mitigation failed, hot key causing latency impact
Logs: Detailed timeline for debugging

Failure modes:

Detection too slow: Increase sampling rate
False positives: Tune threshold, add confirmation period
Mitigation fails: Fallback to rate limiting to protect system

The system should be autonomous for most cases but page humans for anomalies."

Chapter 11: Interview Preparation Checklist

Before your interview, make sure you can:

Concepts

Explain Zipf's law and why traffic is skewed
Describe Count-Min Sketch and Space-Saving algorithms
List five strategies for hot key mitigation with trade-offs

Detection

Design a streaming hot key detector
Set up monitoring and alerting for hot keys
Explain predictive vs reactive detection

Mitigation

Implement multi-layer caching
Design key splitting strategy
Handle write-heavy hot keys (batching, sharding)

Real Systems

Explain how Instagram handles celebrity accounts
Know how DynamoDB, Redis, Cassandra handle hot partitions
Design for specific hot key scenarios (trending hashtags, viral videos)

Exercises

Exercise 1: Hot Key Detection

Implement a hot key detector that:

Uses Count-Min Sketch for initial filtering
Promotes suspected hot keys to exact counting
Supports dynamic threshold adjustment
Reports hot key statistics

Exercise 2: Cache Stampede Prevention

Implement a cache that prevents stampedes when hot keys expire:

Probabilistic early expiration
Request coalescing
Background refresh for hot keys
Measure and report hit rates

Exercise 3: Instagram Feed Design

Design the feed generation system for a social network:

500M users
Average user follows 200 accounts
0.01% of accounts have >1M followers
Show 100 most recent posts from followed accounts
Handle both posting and reading efficiently

Appendix: Code Reference

A.1 Production Hot Key Handler

import threading
import time
import random
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict, Set, Optional, Callable
from enum import Enum
import mmh3
import numpy as np

class HotKeyType(Enum):
    READ_HEAVY = "read_heavy"
    WRITE_HEAVY = "write_heavy"
    MIXED = "mixed"

@dataclass
class HotKeyInfo:
    key: str
    detected_at: float
    request_count: int
    key_type: HotKeyType
    mitigation_active: bool = False

class ProductionHotKeyHandler:
    """
    Production-ready hot key detection and mitigation system.
    
    Features:
    - Count-Min Sketch for memory-efficient detection
    - Automatic classification (read/write heavy)
    - Pluggable mitigation strategies
    - Graceful promotion/demotion
    - Comprehensive metrics
    """
    
    def __init__(
        self,
        hot_threshold: int = 1000,  # requests per window
        window_seconds: int = 60,
        cms_width: int = 10000,
        cms_depth: int = 7,
        on_hot_key: Optional[Callable[[HotKeyInfo], None]] = None,
        on_cool_key: Optional[Callable[[str], None]] = None
    ):
        self.hot_threshold = hot_threshold
        self.window_seconds = window_seconds
        self.on_hot_key = on_hot_key
        self.on_cool_key = on_cool_key
        
        # Count-Min Sketch for approximate counting
        self.cms = np.zeros((cms_depth, cms_width), dtype=np.int64)
        self.cms_depth = cms_depth
        self.cms_width = cms_width
        
        # Exact counting for suspected hot keys
        self.suspected_keys: Dict[str, list] = defaultdict(list)
        
        # Confirmed hot keys
        self.hot_keys: Dict[str, HotKeyInfo] = {}
        
        # Read/write tracking for classification
        self.read_counts: Dict[str, int] = defaultdict(int)
        self.write_counts: Dict[str, int] = defaultdict(int)
        
        # Thread safety
        self.lock = threading.RLock()
        
        # Background tasks
        self._start_background_tasks()
        
        # Metrics
        self.metrics = {
            'total_requests': 0,
            'hot_key_detections': 0,
            'false_positives': 0,
            'mitigations_activated': 0
        }
    
    def record_read(self, key: str) -> bool:
        """Record a read operation. Returns True if key is hot."""
        return self._record(key, is_write=False)
    
    def record_write(self, key: str) -> bool:
        """Record a write operation. Returns True if key is hot."""
        return self._record(key, is_write=True)
    
    def _record(self, key: str, is_write: bool) -> bool:
        with self.lock:
            self.metrics['total_requests'] += 1
            
            # Track read/write ratio
            if is_write:
                self.write_counts[key] += 1
            else:
                self.read_counts[key] += 1
            
            # Update Count-Min Sketch
            self._cms_increment(key)
            estimate = self._cms_estimate(key)
            
            # Check if already confirmed hot
            if key in self.hot_keys:
                return True
            
            # Check if exceeds threshold for suspicion
            if estimate >= self.hot_threshold * 0.8:
                self._track_suspected(key)
            
            return False
    
    def _cms_increment(self, key: str):
        for i in range(self.cms_depth):
            idx = mmh3.hash(key, i) % self.cms_width
            self.cms[i][idx] += 1
    
    def _cms_estimate(self, key: str) -> int:
        min_count = float('inf')
        for i in range(self.cms_depth):
            idx = mmh3.hash(key, i) % self.cms_width
            min_count = min(min_count, self.cms[i][idx])
        return int(min_count)
    
    def _track_suspected(self, key: str):
        """Track suspected hot key with exact timestamps."""
        now = time.time()
        self.suspected_keys[key].append(now)
        
        # Check if confirmed hot
        cutoff = now - self.window_seconds
        recent = [t for t in self.suspected_keys[key] if t > cutoff]
        self.suspected_keys[key] = recent
        
        if len(recent) >= self.hot_threshold:
            self._confirm_hot(key, len(recent))
    
    def _confirm_hot(self, key: str, count: int):
        """Confirm a key as hot and trigger mitigation."""
        if key in self.hot_keys:
            return
        
        key_type = self._classify_key(key)
        info = HotKeyInfo(
            key=key,
            detected_at=time.time(),
            request_count=count,
            key_type=key_type
        )
        
        self.hot_keys[key] = info
        self.metrics['hot_key_detections'] += 1
        
        print(f"🔥 HOT KEY DETECTED: {key} (type: {key_type.value}, count: {count})")
        
        if self.on_hot_key:
            self.on_hot_key(info)
    
    def _classify_key(self, key: str) -> HotKeyType:
        """Classify hot key as read-heavy, write-heavy, or mixed."""
        reads = self.read_counts.get(key, 0)
        writes = self.write_counts.get(key, 0)
        total = reads + writes
        
        if total == 0:
            return HotKeyType.MIXED
        
        read_ratio = reads / total
        
        if read_ratio >= 0.9:
            return HotKeyType.READ_HEAVY
        elif read_ratio <= 0.1:
            return HotKeyType.WRITE_HEAVY
        else:
            return HotKeyType.MIXED
    
    def _check_cooled_keys(self):
        """Check if any hot keys have cooled down."""
        now = time.time()
        cooled = []
        
        with self.lock:
            for key, info in list(self.hot_keys.items()):
                # Check current rate
                recent = [t for t in self.suspected_keys.get(key, []) 
                         if t > now - self.window_seconds]
                
                if len(recent) < self.hot_threshold * 0.5:  # 50% of threshold
                    cooled.append(key)
        
        for key in cooled:
            self._cool_key(key)
    
    def _cool_key(self, key: str):
        """Handle a key cooling down."""
        with self.lock:
            if key in self.hot_keys:
                del self.hot_keys[key]
                print(f"❄️ Key cooled: {key}")
                
                if self.on_cool_key:
                    self.on_cool_key(key)
    
    def _reset_cms(self):
        """Periodically reset CMS to avoid count accumulation."""
        with self.lock:
            self.cms.fill(0)
            self.read_counts.clear()
            self.write_counts.clear()
    
    def _start_background_tasks(self):
        """Start background monitoring tasks."""
        def check_loop():
            while True:
                time.sleep(5)  # Check every 5 seconds
                self._check_cooled_keys()
        
        def reset_loop():
            while True:
                time.sleep(self.window_seconds)
                self._reset_cms()
        
        threading.Thread(target=check_loop, daemon=True).start()
        threading.Thread(target=reset_loop, daemon=True).start()
    
    def is_hot(self, key: str) -> bool:
        """Check if a key is currently hot."""
        return key in self.hot_keys
    
    def get_hot_keys(self) -> Dict[str, HotKeyInfo]:
        """Get all currently hot keys."""
        return dict(self.hot_keys)
    
    def get_metrics(self) -> dict:
        """Get handler metrics."""
        return {
            **self.metrics,
            'current_hot_keys': len(self.hot_keys),
            'suspected_keys': len(self.suspected_keys)
        }


# Example mitigation strategies
class HotKeyMitigationStrategies:
    def __init__(self, cache, db):
        self.cache = cache
        self.db = db
        self.split_factor = 10
        self.split_keys: Set[str] = set()
    
    def handle_hot_key(self, info: HotKeyInfo):
        """Apply appropriate mitigation based on key type."""
        if info.key_type == HotKeyType.READ_HEAVY:
            self._mitigate_read_heavy(info.key)
        elif info.key_type == HotKeyType.WRITE_HEAVY:
            self._mitigate_write_heavy(info.key)
        else:
            self._mitigate_mixed(info.key)
    
    def _mitigate_read_heavy(self, key: str):
        """Mitigation for read-heavy hot keys."""
        print(f"Applying read-heavy mitigation for {key}")
        
        # 1. Extend cache TTL
        value = self.cache.get(key) or self.db.get(key)
        if value:
            self.cache.setex(key, 3600, value)  # 1 hour TTL
        
        # 2. Enable key splitting for cache reads
        self._enable_split_reads(key, value)
    
    def _mitigate_write_heavy(self, key: str):
        """Mitigation for write-heavy hot keys."""
        print(f"Applying write-heavy mitigation for {key}")
        
        # Enable sharded writes
        self.split_keys.add(key)
    
    def _mitigate_mixed(self, key: str):
        """Mitigation for mixed hot keys."""
        print(f"Applying mixed mitigation for {key}")
        
        # Apply both strategies
        value = self.cache.get(key) or self.db.get(key)
        self._enable_split_reads(key, value)
        self.split_keys.add(key)
    
    def _enable_split_reads(self, key: str, value):
        """Distribute value across split keys."""
        if value:
            for i in range(self.split_factor):
                self.cache.setex(f"{key}:{i}", 3600, value)
            self.split_keys.add(key)
    
    def handle_cool_key(self, key: str):
        """Clean up when key cools down."""
        print(f"Cleaning up mitigation for {key}")
        
        # Remove split keys
        for i in range(self.split_factor):
            self.cache.delete(f"{key}:{i}")
        
        self.split_keys.discard(key)
    
    def get(self, key: str):
        """Get with hot key awareness."""
        if key in self.split_keys:
            # Read from random split
            split_key = f"{key}:{random.randint(0, self.split_factor - 1)}"
            return self.cache.get(split_key)
        return self.cache.get(key)
    
    def set(self, key: str, value: str):
        """Set with hot key awareness."""
        if key in self.split_keys:
            # Write to all splits
            for i in range(self.split_factor):
                self.cache.setex(f"{key}:{i}", 3600, value)
        else:
            self.cache.setex(key, 300, value)

End of Day 4: Hot Keys and Skew

Tomorrow: Day 5 — Session Store Design. We'll tie together everything from this week by designing a complete session store for 10M concurrent users, including partitioning, replication, and handling datacenter failover.

Back to Course Overview

Week 1 — Day 4: Hot Keys and Skew

System Design Mastery Series

Preface

Part I: Foundations

Chapter 1: Understanding Hot Keys and Skew

1.1 What Makes a Key "Hot"?

1.2 The Mathematics of Skew: Zipf's Law

1.3 Types of Hot Key Scenarios

Scenario 1: Predictable Hot Keys

Scenario 2: Sudden Hot Keys (Viral Content)

Scenario 3: Temporal Hot Keys

Scenario 4: Hot Partitions (Not Just Keys)

1.4 The Impact of Hot Keys

Chapter 2: Detecting Hot Keys

2.1 Real-Time Detection

Approach 1: Request Counting with Streaming

Approach 2: Count-Min Sketch (Probabilistic)

2.2 Monitoring and Alerting

2.3 Predictive Detection

Using Access Patterns

External Signals

Chapter 3: Mitigation Strategies

3.1 Strategy 1: Caching

3.2 Strategy 2: Read Replicas for Hot Keys

3.3 Strategy 3: Key Splitting (Salting)

3.4 Strategy 4: Local Caching

3.5 Strategy 5: Request Coalescing

3.6 Strategy Comparison

Part II: The Design Challenge

Chapter 4: Redesigning the URL Shortener for Skewed Traffic

4.1 The Problem Illustrated

4.2 Solution Architecture

4.3 Implementation

4.4 CDN Configuration for Hot URLs

4.5 Traffic Analysis

Chapter 5: How Instagram Handles Hot Celebrity Posts

5.1 The Scale of the Problem

5.2 Feed Generation: Hybrid Push/Pull

5.3 Like Counters: Approximate Counting

5.4 Sharding by Post ID with Hot Shard Detection

5.5 Key Lessons from Instagram

Part III: Advanced Topics

Chapter 6: Write-Heavy Hot Keys

6.1 The Challenge

6.2 Buffering and Batching

6.3 Distributed Counting with CRDTs

6.4 Sharded Counters

Chapter 7: Hot Keys in Different Systems

7.1 Hot Keys in Kafka

7.2 Hot Keys in DynamoDB

7.3 Hot Keys in Elasticsearch

Part IV: Discussion and Trade-offs

Chapter 8: The Hard Questions

8.1 "Redesign URL shortener assuming 0.01% of URLs get 90% of traffic"

8.2 "How does Instagram handle hot celebrity posts?"

8.3 "How do you decide between caching, sharding, and replication for hot keys?"

Chapter 9: Session Summary

What You Should Know Now

Key Trade-offs to Remember

Questions to Ask in Every Design

Part V: Interview Questions and Answers

Chapter 10: Real-World Interview Scenarios

10.1 Conceptual Questions

Question 1: "What is a hot key and why is it a problem?"

Question 2: "Explain the Count-Min Sketch and why it's useful for hot key detection."

Question 3: "Compare different strategies for handling write-heavy hot keys."

10.2 Design Questions

Question 4: "Design a system to handle trending hashtags on Twitter."

Question 5: "Design a view counter for YouTube videos."

10.3 Scenario-Based Questions

Question 6: "Your database partition is getting hot due to a viral product. What do you do?"

Question 7: "How would you test your hot key handling?"

10.4 Deep-Dive Questions

Question 8: "How do different databases handle hot partitions?"

Question 9: "Explain how you'd implement automatic hot key detection and mitigation."

Chapter 11: Interview Preparation Checklist

Concepts

Detection

Mitigation

Real Systems