Week 1 — Day 4: Hot Keys and Skew
System Design Mastery Series
Preface
Days 1-3 taught you how to partition, replicate, and protect your data. We made a convenient assumption throughout: traffic is evenly distributed. Today, we shatter that assumption.
In the real world, data access follows power laws:
- 0.01% of URLs get 90% of clicks
- A celebrity tweet gets 10 million views in an hour
- One product goes viral on TikTok and crashes your inventory system
- A single customer generates 50% of your API traffic
These hot keys break every carefully designed distributed system. A perfectly partitioned database becomes a single-node bottleneck when one partition receives all the traffic.
By the end of this session, you'll understand why hot keys happen, how to detect them before they cause outages, and multiple strategies for mitigating them—from caching to key splitting to architectural redesigns.
Let's begin.
Part I: Foundations
Chapter 1: Understanding Hot Keys and Skew
1.1 What Makes a Key "Hot"?
A key is "hot" when it receives disproportionately more operations than other keys. This creates a bottleneck because:
- With partitioning: All requests for that key go to one partition, overwhelming it while others sit idle
- With caching: Cache eviction policies may not keep the hot key cached
- With rate limiting: The hot key may exhaust quotas intended for many keys
Quantifying "hot": A key is problematically hot when:
- It receives >1000x the average key's traffic
- It consumes >10% of a partition's capacity
- It causes measurable latency degradation for other keys on the same partition
1.2 The Mathematics of Skew: Zipf's Law
Traffic to most systems follows Zipf's Law: the nth most popular item receives approximately 1/n of the traffic of the most popular item.
Rank 1: 1000 requests/sec
Rank 2: 500 requests/sec (1/2)
Rank 3: 333 requests/sec (1/3)
Rank 4: 250 requests/sec (1/4)
...
Rank 100: 10 requests/sec (1/100)
This means:
- The top 1% of keys receive ~30% of traffic
- The top 10% of keys receive ~65% of traffic
- The top 20% of keys receive ~80% of traffic
In extreme cases (viral content, celebrity accounts):
- The top 0.01% of keys receive >90% of traffic
1.3 Types of Hot Key Scenarios
Scenario 1: Predictable Hot Keys
Some keys are always hot—you know in advance which ones:
- The homepage of a website
- Popular product categories
- Well-known user accounts
These are easier to handle because you can pre-optimize.
Scenario 2: Sudden Hot Keys (Viral Content)
A previously cold key suddenly becomes hot:
- A tweet goes viral
- A URL is shared on Reddit's front page
- A product is featured on a TV show
These are harder because you can't pre-optimize—you must detect and react.
Scenario 3: Temporal Hot Keys
Keys that are hot only during specific times:
- Flash sales (specific product IDs)
- Live events (event IDs during the event)
- Time-based aggregations (current hour's counters)
These are somewhat predictable but require dynamic handling.
Scenario 4: Hot Partitions (Not Just Keys)
Sometimes the problem isn't one key but the partition assignment:
- Range partitioning with auto-increment IDs: newest partition always hot
- Hash collision: multiple unrelated keys hash to same partition
- Correlated access: keys that are always accessed together land on same partition
1.4 The Impact of Hot Keys
On latency:
Normal key: P50: 5ms, P99: 20ms
Hot key: P50: 50ms, P99: 500ms (10x worse)
Other keys on same partition: P50: 15ms, P99: 100ms (3-5x worse due to contention)
On availability:
- Hot partition becomes overloaded
- Health checks fail
- Partition marked unhealthy
- Traffic rerouted to replicas (which also become overloaded)
- Cascade failure
On cost:
- Over-provisioning to handle peak hot key traffic
- 10x infrastructure cost for 0.01% of keys
Chapter 2: Detecting Hot Keys
You can't fix what you can't see. Detection is the first step.
2.1 Real-Time Detection
Approach 1: Request Counting with Streaming
Count requests per key in a sliding window. Alert when any key exceeds a threshold.
from collections import defaultdict
import time
import threading
class HotKeyDetector:
def __init__(
self,
window_seconds: int = 60,
hot_threshold: int = 10000,
check_interval: float = 1.0
):
self.window_seconds = window_seconds
self.hot_threshold = hot_threshold
self.check_interval = check_interval
# key -> list of timestamps
self.request_times: dict[str, list[float]] = defaultdict(list)
self.lock = threading.Lock()
self.hot_keys: set[str] = set()
# Start background checker
self._start_checker()
def record(self, key: str) -> bool:
"""Record a request. Returns True if key is hot."""
now = time.time()
with self.lock:
self.request_times[key].append(now)
return key in self.hot_keys
def _check_hot_keys(self):
"""Periodic check for hot keys."""
now = time.time()
cutoff = now - self.window_seconds
with self.lock:
new_hot_keys = set()
keys_to_clean = []
for key, times in self.request_times.items():
# Filter to window
recent = [t for t in times if t > cutoff]
if len(recent) == 0:
keys_to_clean.append(key)
else:
self.request_times[key] = recent
if len(recent) >= self.hot_threshold:
new_hot_keys.add(key)
# Clean up cold keys
for key in keys_to_clean:
del self.request_times[key]
# Detect newly hot keys
newly_hot = new_hot_keys - self.hot_keys
no_longer_hot = self.hot_keys - new_hot_keys
if newly_hot:
self._alert_new_hot_keys(newly_hot)
if no_longer_hot:
self._alert_cooled_keys(no_longer_hot)
self.hot_keys = new_hot_keys
def _alert_new_hot_keys(self, keys: set[str]):
print(f"🔥 NEW HOT KEYS DETECTED: {keys}")
# Send to monitoring system, trigger mitigation
def _alert_cooled_keys(self, keys: set[str]):
print(f"❄️ Keys cooled down: {keys}")
def _start_checker(self):
def run():
while True:
time.sleep(self.check_interval)
self._check_hot_keys()
thread = threading.Thread(target=run, daemon=True)
thread.start()
Problem: This stores every timestamp, which is expensive for high-traffic keys.
Approach 2: Count-Min Sketch (Probabilistic)
Use a probabilistic data structure for approximate counting with fixed memory.
import mmh3 # MurmurHash
import numpy as np
class CountMinSketch:
"""
Probabilistic data structure for frequency estimation.
Uses fixed memory regardless of number of unique keys.
"""
def __init__(self, width: int = 10000, depth: int = 7):
self.width = width
self.depth = depth
self.table = np.zeros((depth, width), dtype=np.int64)
def _hash(self, key: str, seed: int) -> int:
return mmh3.hash(key, seed) % self.width
def increment(self, key: str, count: int = 1):
for i in range(self.depth):
idx = self._hash(key, i)
self.table[i][idx] += count
def estimate(self, key: str) -> int:
"""Returns estimated count (may overestimate, never underestimates)."""
min_count = float('inf')
for i in range(self.depth):
idx = self._hash(key, i)
min_count = min(min_count, self.table[i][idx])
return int(min_count)
def reset(self):
"""Reset all counters (call periodically for time-windowed counting)."""
self.table.fill(0)
class CMSHotKeyDetector:
"""Hot key detection using Count-Min Sketch."""
def __init__(self, hot_threshold: int = 10000, window_seconds: int = 60):
self.cms = CountMinSketch()
self.hot_threshold = hot_threshold
self.window_seconds = window_seconds
# Track potential hot keys for exact counting
self.potential_hot_keys: dict[str, int] = {}
self.confirmed_hot_keys: set[str] = set()
# Periodic reset
self._start_reset_timer()
def record(self, key: str) -> bool:
"""Record request. Returns True if key is confirmed hot."""
self.cms.increment(key)
estimate = self.cms.estimate(key)
# If estimate exceeds threshold, do exact counting
if estimate >= self.hot_threshold * 0.8: # 80% threshold for promotion
self.potential_hot_keys[key] = self.potential_hot_keys.get(key, 0) + 1
if self.potential_hot_keys[key] >= self.hot_threshold:
if key not in self.confirmed_hot_keys:
self.confirmed_hot_keys.add(key)
self._on_hot_key_detected(key)
return True
return key in self.confirmed_hot_keys
def _on_hot_key_detected(self, key: str):
print(f"🔥 HOT KEY: {key}")
def _start_reset_timer(self):
def reset_loop():
while True:
time.sleep(self.window_seconds)
self.cms.reset()
self.potential_hot_keys.clear()
# Keep confirmed_hot_keys until explicitly cleared
threading.Thread(target=reset_loop, daemon=True).start()
Advantages:
- Fixed memory: O(width × depth) regardless of key count
- Fast: O(depth) per operation
- Good enough for detection (some false positives, no false negatives)
2.2 Monitoring and Alerting
Set up dashboards and alerts for:
# Prometheus metrics for hot key monitoring
from prometheus_client import Counter, Histogram, Gauge
# Request count per key (use with care - high cardinality)
key_requests = Counter(
'requests_by_key_total',
'Total requests per key',
['key']
)
# Hot key gauge
hot_key_count = Gauge(
'hot_keys_count',
'Number of currently hot keys'
)
# Request latency by partition
partition_latency = Histogram(
'partition_request_latency_seconds',
'Request latency by partition',
['partition_id'],
buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
# Partition load imbalance
partition_load = Gauge(
'partition_requests_per_second',
'Requests per second per partition',
['partition_id']
)
Alert conditions:
# Prometheus alerting rules
groups:
- name: hot_keys
rules:
- alert: HotKeyDetected
expr: rate(requests_by_key_total[1m]) > 1000
for: 30s
labels:
severity: warning
annotations:
summary: "Hot key detected: {{ $labels.key }}"
- alert: PartitionImbalance
expr: |
max(partition_requests_per_second) /
avg(partition_requests_per_second) > 5
for: 1m
labels:
severity: critical
annotations:
summary: "Partition load imbalance detected"
2.3 Predictive Detection
Don't just react—predict hot keys before they cause problems.
Using Access Patterns
class PredictiveHotKeyDetector:
"""
Predict hot keys based on access pattern acceleration.
A key becoming hot shows rapid growth before hitting threshold.
"""
def __init__(self, growth_threshold: float = 10.0, window_seconds: int = 60):
self.growth_threshold = growth_threshold
self.window_seconds = window_seconds
# Track request rates over time
self.rate_history: dict[str, list[tuple[float, int]]] = defaultdict(list)
def record_rate(self, key: str, current_rate: int):
"""Record current rate for a key."""
now = time.time()
self.rate_history[key].append((now, current_rate))
# Keep only recent history
cutoff = now - self.window_seconds
self.rate_history[key] = [
(t, r) for t, r in self.rate_history[key] if t > cutoff
]
# Check for acceleration
if self._is_accelerating(key):
self._alert_predicted_hot_key(key, current_rate)
def _is_accelerating(self, key: str) -> bool:
history = self.rate_history[key]
if len(history) < 3:
return False
# Compare recent rate to older rate
recent = history[-1][1]
older = history[0][1]
if older == 0:
return recent > 100 # Sudden spike from zero
growth = recent / older
return growth >= self.growth_threshold
def _alert_predicted_hot_key(self, key: str, rate: int):
print(f"⚠️ PREDICTED HOT KEY: {key} (rate: {rate}, accelerating)")
External Signals
Monitor external sources that predict traffic:
- Social media mentions (Twitter API, Reddit)
- News mentions
- Marketing campaign schedules
- Scheduled events (product launches, sales)
class ExternalSignalMonitor:
"""Monitor external signals that predict hot keys."""
def __init__(self):
self.known_upcoming_hot_keys: dict[str, dict] = {}
def register_marketing_campaign(self, key: str, start_time: datetime, expected_traffic: int):
"""Marketing knows they're about to drive traffic to a URL."""
self.known_upcoming_hot_keys[key] = {
'type': 'marketing_campaign',
'start_time': start_time,
'expected_traffic': expected_traffic
}
self._pre_warm_caches(key)
self._pre_scale_partition(key)
def check_social_media_trends(self):
"""Poll social media for URLs being shared."""
# Check Twitter, Reddit, etc. for your domain
# If a URL is trending, pre-warm it
pass
def _pre_warm_caches(self, key: str):
"""Load key into cache before traffic arrives."""
print(f"Pre-warming caches for {key}")
def _pre_scale_partition(self, key: str):
"""Add capacity to the partition owning this key."""
print(f"Pre-scaling partition for {key}")
Chapter 3: Mitigation Strategies
Once you've detected a hot key, how do you handle it?
3.1 Strategy 1: Caching
The most common and effective strategy. If you can serve from cache, the database never sees the hot key traffic.
Request → Cache → [HIT] → Return (99% of hot key requests)
→ [MISS] → Database → Cache → Return (1%)
For read-heavy hot keys (most common):
class HotKeyCachingStrategy:
def __init__(self, redis_client, db_client, hot_key_detector):
self.redis = redis_client
self.db = db_client
self.detector = hot_key_detector
# Different TTLs for hot vs normal keys
self.normal_ttl = 300 # 5 minutes
self.hot_key_ttl = 3600 # 1 hour (keep hot keys cached longer)
def get(self, key: str) -> Optional[str]:
# Try cache
value = self.redis.get(key)
if value:
return value
# Cache miss - get from DB
value = self.db.get(key)
if value:
# Use longer TTL for hot keys
ttl = self.hot_key_ttl if self.detector.is_hot(key) else self.normal_ttl
self.redis.setex(key, ttl, value)
return value
Problem: Cache stampede on hot key expiration
When a hot key expires, thousands of requests simultaneously hit the database.
Solution: Probabilistic early expiration
import random
import math
class StampedeProtectedCache:
"""
Implements probabilistic early expiration to prevent cache stampede.
Based on the XFetch algorithm.
"""
def __init__(self, redis_client, db_client, beta: float = 1.0):
self.redis = redis_client
self.db = db_client
self.beta = beta # Controls early recomputation probability
def get(self, key: str, ttl: int = 300) -> Optional[str]:
# Get value with metadata
data = self.redis.hgetall(f"cache:{key}")
if not data:
return self._recompute_and_cache(key, ttl)
value = data.get('value')
delta = float(data.get('delta', 1)) # Time to recompute
expiry = float(data.get('expiry', 0))
now = time.time()
# Probabilistic early expiration
# Higher probability of recompute as we approach expiry
time_until_expiry = expiry - now
if time_until_expiry <= 0:
# Actually expired
return self._recompute_and_cache(key, ttl)
# XFetch formula: recompute if random < delta * beta * log(random)
if time_until_expiry < delta * self.beta * (-math.log(random.random())):
# Early recomputation (async, don't block)
threading.Thread(
target=self._recompute_and_cache,
args=(key, ttl),
daemon=True
).start()
return value
def _recompute_and_cache(self, key: str, ttl: int) -> Optional[str]:
start = time.time()
value = self.db.get(key)
delta = time.time() - start # How long recomputation took
if value:
expiry = time.time() + ttl
self.redis.hset(f"cache:{key}", mapping={
'value': value,
'delta': delta,
'expiry': expiry
})
self.redis.expire(f"cache:{key}", ttl + 60) # Extra buffer
return value
3.2 Strategy 2: Read Replicas for Hot Keys
Add read replicas specifically for hot partitions.
Normal traffic:
Request → Partition Primary
Hot key traffic:
Request → Load Balancer → [Hot Key Replica 1]
→ [Hot Key Replica 2]
→ [Hot Key Replica 3]
class HotKeyReplicaRouter:
def __init__(self, partition_map, hot_key_replicas: dict[str, list]):
self.partition_map = partition_map
self.hot_key_replicas = hot_key_replicas # key -> list of replica connections
def get_connection(self, key: str):
if key in self.hot_key_replicas:
# Round-robin across hot key replicas
replicas = self.hot_key_replicas[key]
return random.choice(replicas)
else:
# Normal routing
partition = self.partition_map.get_partition(key)
return partition.get_replica()
When to use:
- Hot keys are predictable (you know which ones)
- You can afford dedicated infrastructure
- Cache isn't sufficient (high write rate or consistency requirements)
3.3 Strategy 3: Key Splitting (Salting)
Split one logical hot key into multiple physical keys distributed across partitions.
Logical key: "viral_url_abc"
Physical keys:
"viral_url_abc:0" → Partition 0
"viral_url_abc:1" → Partition 1
"viral_url_abc:2" → Partition 2
...
"viral_url_abc:9" → Partition 9
Reads pick a random suffix; writes go to all suffixes.
class SplitKeyStrategy:
def __init__(self, split_factor: int = 10):
self.split_factor = split_factor
self.hot_keys: set[str] = set()
def mark_hot(self, key: str):
"""Mark a key as hot, enabling splitting."""
self.hot_keys.add(key)
def mark_cold(self, key: str):
"""Mark a key as no longer hot."""
self.hot_keys.discard(key)
def get_read_key(self, key: str) -> str:
"""Get the physical key for reading."""
if key in self.hot_keys:
suffix = random.randint(0, self.split_factor - 1)
return f"{key}:{suffix}"
return key
def get_write_keys(self, key: str) -> list[str]:
"""Get all physical keys for writing."""
if key in self.hot_keys:
return [f"{key}:{i}" for i in range(self.split_factor)]
return [key]
def read(self, key: str, cache) -> Optional[str]:
"""Read from a random split."""
physical_key = self.get_read_key(key)
return cache.get(physical_key)
def write(self, key: str, value: str, cache):
"""Write to all splits."""
physical_keys = self.get_write_keys(key)
for pk in physical_keys:
cache.set(pk, value)
Trade-offs:
- ✅ Spreads load across partitions
- ❌ Writes amplified by split factor
- ❌ Updates require coordinating all splits
- ❌ Consistency challenges (stale reads during updates)
Best for: Read-heavy hot keys with infrequent updates (like viral URLs).
3.4 Strategy 4: Local Caching
Cache hot keys in application server memory, not just Redis.
Request → Local Cache → [HIT] → Return (fastest)
→ [MISS] → Redis → [HIT] → Local Cache → Return
→ [MISS] → DB → Redis → Local Cache → Return
from functools import lru_cache
from cachetools import TTLCache
import threading
class TwoLevelCache:
def __init__(self, redis_client, db_client, local_max_size: int = 1000, local_ttl: int = 10):
self.redis = redis_client
self.db = db_client
# Local in-memory cache with TTL
self.local_cache = TTLCache(maxsize=local_max_size, ttl=local_ttl)
self.local_lock = threading.Lock()
# Track hot keys for local caching
self.hot_keys: set[str] = set()
def get(self, key: str) -> Optional[str]:
# Level 1: Local cache (only for hot keys)
if key in self.hot_keys:
with self.local_lock:
if key in self.local_cache:
return self.local_cache[key]
# Level 2: Redis
value = self.redis.get(key)
if value:
if key in self.hot_keys:
with self.local_lock:
self.local_cache[key] = value
return value
# Level 3: Database
value = self.db.get(key)
if value:
self.redis.setex(key, 300, value)
if key in self.hot_keys:
with self.local_lock:
self.local_cache[key] = value
return value
def mark_hot(self, key: str):
self.hot_keys.add(key)
def invalidate(self, key: str):
"""Invalidate across all levels."""
with self.local_lock:
self.local_cache.pop(key, None)
self.redis.delete(key)
Challenge: Consistency across application servers
Each server has its own local cache. When data changes, all caches must be invalidated.
Solution: Pub/Sub invalidation
class DistributedLocalCache:
def __init__(self, redis_client, channel: str = "cache_invalidation"):
self.redis = redis_client
self.channel = channel
self.local_cache = TTLCache(maxsize=1000, ttl=10)
# Subscribe to invalidation messages
self._start_subscriber()
def _start_subscriber(self):
def listen():
pubsub = self.redis.pubsub()
pubsub.subscribe(self.channel)
for message in pubsub.listen():
if message['type'] == 'message':
key = message['data']
self.local_cache.pop(key, None)
threading.Thread(target=listen, daemon=True).start()
def invalidate(self, key: str):
"""Invalidate locally and broadcast to other servers."""
self.local_cache.pop(key, None)
self.redis.publish(self.channel, key)
3.5 Strategy 5: Request Coalescing
Multiple concurrent requests for the same key get merged into one database query.
Time 0: Request A for key "hot" arrives → Start DB query
Time 1: Request B for key "hot" arrives → Wait for A's result
Time 2: Request C for key "hot" arrives → Wait for A's result
Time 3: DB query completes → Return to A, B, C simultaneously
import asyncio
from typing import Dict, Optional
from collections import defaultdict
class RequestCoalescer:
"""
Coalesce multiple concurrent requests for the same key
into a single database query.
"""
def __init__(self, db_client):
self.db = db_client
self.pending: Dict[str, asyncio.Future] = {}
self.lock = asyncio.Lock()
async def get(self, key: str) -> Optional[str]:
async with self.lock:
if key in self.pending:
# Another request is already fetching this key
# Wait for its result
return await self.pending[key]
# We're the first - create a future for others to wait on
future = asyncio.Future()
self.pending[key] = future
try:
# Fetch from database
value = await self.db.get(key)
future.set_result(value)
return value
except Exception as e:
future.set_exception(e)
raise
finally:
async with self.lock:
del self.pending[key]
# Synchronous version using threading
class SyncRequestCoalescer:
def __init__(self, db_client):
self.db = db_client
self.pending: Dict[str, threading.Event] = {}
self.results: Dict[str, any] = {}
self.lock = threading.Lock()
def get(self, key: str) -> Optional[str]:
with self.lock:
if key in self.pending:
event = self.pending[key]
else:
event = threading.Event()
self.pending[key] = event
# We're the leader - start the fetch
threading.Thread(
target=self._fetch,
args=(key, event),
daemon=True
).start()
# Wait for result
event.wait(timeout=5.0)
with self.lock:
return self.results.get(key)
def _fetch(self, key: str, event: threading.Event):
try:
value = self.db.get(key)
with self.lock:
self.results[key] = value
finally:
event.set()
# Cleanup after a delay to handle late arrivals
time.sleep(0.1)
with self.lock:
self.pending.pop(key, None)
self.results.pop(key, None)
3.6 Strategy Comparison
| Strategy | Best For | Complexity | Consistency | Write Impact |
|---|---|---|---|---|
| Caching | Read-heavy, tolerance for stale | Low | Eventual | None |
| Read Replicas | Predictable hot keys | Medium | Depends on replication | None |
| Key Splitting | Read-heavy, infrequent writes | Medium | Weak | High (fan-out) |
| Local Caching | Ultra-low latency needed | Medium | Weak | Medium (invalidation) |
| Request Coalescing | Cache misses, cold starts | Low | Strong | None |
Part II: The Design Challenge
Chapter 4: Redesigning the URL Shortener for Skewed Traffic
We return to our URL shortener from Days 1-3. New requirement:
0.01% of URLs receive 90% of traffic
With 1 billion URLs, that's 100,000 hot URLs receiving most of the 10,000 requests/second.
4.1 The Problem Illustrated
Current architecture:
URLs distributed by hash(short_code) across 10 partitions
Each partition: ~100M URLs, ~1000 req/sec average
With hot keys:
Partition 3: Has 10 hot URLs → 5000 req/sec (5x average)
Partition 7: Has 50 hot URLs → 7000 req/sec (7x average)
Other partitions: 100-300 req/sec
Partitions 3 and 7 are overloaded. Their latency spikes. Users experience timeouts.
4.2 Solution Architecture
We'll implement a multi-layer defense:
┌─────────────────────────────────────────────┐
│ Request Flow │
└─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ Layer 1: CDN Edge Cache │
│ - Cache static redirects at edge │
│ - ~80% of hot key traffic served here │
└────────────────────────────────────────┬────────────────────────────────────────────┘
│ Cache miss
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ Layer 2: Application Server (Local Cache + Hot Key Detection) │
│ - Local in-memory cache for hot keys │
│ - Hot key detector marks trending URLs │
│ - ~15% of remaining traffic served here │
└────────────────────────────────────────┬────────────────────────────────────────────┘
│ Cache miss
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ Layer 3: Redis Cluster (Distributed Cache) │
│ - Split keys for known hot URLs │
│ - Request coalescing for concurrent misses │
│ - ~4% of remaining traffic served here │
└────────────────────────────────────────┬────────────────────────────────────────────┘
│ Cache miss (~1%)
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ Layer 4: Database (Partitioned PostgreSQL) │
│ - Read replicas for hot partitions │
│ - Only cold keys and cache misses reach here │
└─────────────────────────────────────────────────────────────────────────────────────┘
4.3 Implementation
from dataclasses import dataclass
from typing import Optional, Set
import random
import time
import threading
import redis
import psycopg2
@dataclass
class URLMapping:
short_code: str
original_url: str
created_at: float
is_hot: bool = False
class SkewResistantURLShortener:
"""
URL shortener designed to handle highly skewed traffic.
0.01% of URLs can receive 90% of traffic without degradation.
"""
def __init__(
self,
redis_cluster: redis.RedisCluster,
db_pool: psycopg2.pool.ThreadedConnectionPool,
hot_key_split_factor: int = 10
):
self.redis = redis_cluster
self.db_pool = db_pool
self.split_factor = hot_key_split_factor
# Layer 2: Local cache for hot keys
self.local_cache: dict[str, tuple[str, float]] = {} # key -> (value, expiry)
self.local_cache_lock = threading.Lock()
self.local_cache_ttl = 5 # seconds
# Hot key tracking
self.hot_keys: Set[str] = set()
self.hot_key_detector = CMSHotKeyDetector(
hot_threshold=1000, # 1000 req/min = hot
window_seconds=60
)
# Request coalescing
self.coalescer = SyncRequestCoalescer(self)
# Metrics
self.stats = {
'local_hits': 0,
'redis_hits': 0,
'db_hits': 0,
'total': 0
}
def redirect(self, short_code: str) -> Optional[str]:
"""
Get the original URL for a short code.
Multi-layer caching with hot key optimization.
"""
self.stats['total'] += 1
# Track for hot key detection
is_hot = self.hot_key_detector.record(short_code)
if is_hot and short_code not in self.hot_keys:
self._promote_to_hot(short_code)
# Layer 2: Check local cache (hot keys only)
if short_code in self.hot_keys:
url = self._check_local_cache(short_code)
if url:
self.stats['local_hits'] += 1
return url
# Layer 3: Check Redis (with split keys for hot keys)
url = self._check_redis(short_code)
if url:
self.stats['redis_hits'] += 1
self._update_local_cache(short_code, url)
return url
# Layer 4: Database (with coalescing)
url = self.coalescer.get(short_code)
if url:
self.stats['db_hits'] += 1
self._update_redis(short_code, url)
self._update_local_cache(short_code, url)
return url
def _check_local_cache(self, key: str) -> Optional[str]:
"""Check local in-memory cache."""
with self.local_cache_lock:
if key in self.local_cache:
value, expiry = self.local_cache[key]
if time.time() < expiry:
return value
else:
del self.local_cache[key]
return None
def _update_local_cache(self, key: str, value: str):
"""Update local cache for hot keys."""
if key in self.hot_keys:
with self.local_cache_lock:
self.local_cache[key] = (value, time.time() + self.local_cache_ttl)
def _check_redis(self, key: str) -> Optional[str]:
"""Check Redis, using split keys for hot keys."""
if key in self.hot_keys:
# Read from random split
suffix = random.randint(0, self.split_factor - 1)
redis_key = f"url:{key}:{suffix}"
else:
redis_key = f"url:{key}"
return self.redis.get(redis_key)
def _update_redis(self, key: str, value: str):
"""Update Redis, writing to all splits for hot keys."""
if key in self.hot_keys:
# Write to all splits
pipe = self.redis.pipeline()
for i in range(self.split_factor):
pipe.setex(f"url:{key}:{i}", 3600, value) # 1 hour TTL
pipe.execute()
else:
self.redis.setex(f"url:{key}", 300, value) # 5 min TTL
def _db_get(self, key: str) -> Optional[str]:
"""Fetch from database."""
conn = self.db_pool.getconn()
try:
with conn.cursor() as cur:
cur.execute(
"SELECT original_url FROM urls WHERE short_code = %s",
(key,)
)
result = cur.fetchone()
return result[0] if result else None
finally:
self.db_pool.putconn(conn)
def _promote_to_hot(self, key: str):
"""Promote a key to hot status."""
print(f"🔥 Promoting {key} to hot key")
self.hot_keys.add(key)
# Pre-populate all Redis splits
url = self.redis.get(f"url:{key}") or self._db_get(key)
if url:
self._update_redis(key, url)
def _demote_from_hot(self, key: str):
"""Demote a key from hot status."""
print(f"❄️ Demoting {key} from hot key")
self.hot_keys.discard(key)
# Clean up local cache
with self.local_cache_lock:
self.local_cache.pop(key, None)
# Clean up split keys (keep one)
for i in range(1, self.split_factor):
self.redis.delete(f"url:{key}:{i}")
def get_stats(self) -> dict:
"""Return cache hit statistics."""
total = self.stats['total'] or 1
return {
'total_requests': total,
'local_hit_rate': self.stats['local_hits'] / total,
'redis_hit_rate': self.stats['redis_hits'] / total,
'db_hit_rate': self.stats['db_hits'] / total,
'hot_key_count': len(self.hot_keys)
}
4.4 CDN Configuration for Hot URLs
Configure your CDN to cache redirects:
# Cloudflare Worker for URL shortener caching
WORKER_CODE = """
addEventListener('fetch', event => {
event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
const url = new URL(request.url)
const shortCode = url.pathname.slice(1) // Remove leading /
// Check edge cache
const cache = caches.default
const cacheKey = new Request(url.toString(), request)
let response = await cache.match(cacheKey)
if (response) {
// Cache hit - add header for debugging
response = new Response(response.body, response)
response.headers.set('X-Cache', 'HIT')
return response
}
// Cache miss - fetch from origin
response = await fetch(request)
// Cache successful redirects
if (response.status === 301 || response.status === 302) {
const cacheResponse = response.clone()
// Cache for 1 hour at edge
const headers = new Headers(cacheResponse.headers)
headers.set('Cache-Control', 'public, max-age=3600')
const cachedResponse = new Response(cacheResponse.body, {
status: cacheResponse.status,
statusText: cacheResponse.statusText,
headers: headers
})
event.waitUntil(cache.put(cacheKey, cachedResponse))
}
response.headers.set('X-Cache', 'MISS')
return response
}
"""
4.5 Traffic Analysis
Let's analyze the traffic flow with our new architecture:
Total traffic: 10,000 req/sec
Without hot key handling:
All 10,000 req/sec hit Redis/DB
Hot partitions: 5,000-7,000 req/sec each
With hot key handling:
Layer 1 (CDN): 80% of hot key traffic = ~7,200 req/sec served at edge
Layer 2 (Local): 15% of remaining = ~420 req/sec served locally
Layer 3 (Redis): 4% of remaining = ~95 req/sec to Redis
Layer 4 (DB): 1% of remaining = ~25 req/sec to DB (cache misses)
Remaining traffic to process: ~2,800 req/sec (vs 10,000)
Distributed evenly across partitions: ~280 req/sec each
Latency improvement:
Before: P99 = 500ms (hot partition overloaded)
After: P99 = 20ms (load distributed)
Chapter 5: How Instagram Handles Hot Celebrity Posts
Let's study a real-world system dealing with extreme hot keys.
5.1 The Scale of the Problem
Instagram's challenges:
- Cristiano Ronaldo has 600M+ followers
- A post from him generates millions of interactions in minutes
- Showing his post in 600M feeds requires massive fan-out
- Like counts update millions of times per hour
5.2 Feed Generation: Hybrid Push/Pull
Instagram uses different strategies for different account sizes:
For regular users (< 10K followers): Push on write
User posts → For each follower: Add to their feed cache
- Pre-computed feeds
- Fast reads
- Writes fan out to thousands, not millions
For celebrities (> 10K followers): Pull on read
User opens app → Fetch posts from followed celebrities → Merge with pre-computed feed
- No massive write fan-out
- More work at read time
- Celebrity posts fetched on-demand
class InstagramFeedStrategy:
CELEBRITY_THRESHOLD = 10000
def __init__(self, feed_cache, post_store):
self.feed_cache = feed_cache
self.post_store = post_store
def on_new_post(self, author_id: str, post_id: str, follower_count: int):
if follower_count < self.CELEBRITY_THRESHOLD:
# Push to all followers' feeds
self._push_to_followers(author_id, post_id)
else:
# Just store the post - will be pulled on read
self.post_store.add_celebrity_post(author_id, post_id)
def get_feed(self, user_id: str, followed_celebrities: list[str]) -> list[str]:
# Get pre-computed feed
feed = self.feed_cache.get_feed(user_id)
# Merge celebrity posts
for celeb_id in followed_celebrities:
celeb_posts = self.post_store.get_recent_posts(celeb_id)
feed = self._merge_by_time(feed, celeb_posts)
return feed[:100] # Top 100 posts
5.3 Like Counters: Approximate Counting
Showing exact like counts for viral posts is expensive. Instagram uses approximation:
class ApproximateLikeCounter:
"""
Approximate counting for high-volume counters.
Shows "1.2M likes" instead of "1,234,567 likes".
"""
def __init__(self, redis_client):
self.redis = redis_client
# Exact counts for small numbers
self.exact_threshold = 10000
# Sampling rate for large counts
self.sample_rate = 0.01 # Count 1 in 100
def increment(self, post_id: str):
current = self.get_count(post_id)
if current < self.exact_threshold:
# Exact counting for small numbers
self.redis.incr(f"likes:{post_id}")
else:
# Probabilistic increment for large numbers
if random.random() < self.sample_rate:
self.redis.incrby(f"likes:{post_id}", int(1 / self.sample_rate))
def get_count(self, post_id: str) -> int:
return int(self.redis.get(f"likes:{post_id}") or 0)
def get_display_count(self, post_id: str) -> str:
count = self.get_count(post_id)
if count < 1000:
return str(count)
elif count < 1000000:
return f"{count // 1000}K"
else:
return f"{count / 1000000:.1f}M"
5.4 Sharding by Post ID with Hot Shard Detection
class PostShardManager:
"""
Manages post data across shards with hot shard mitigation.
"""
def __init__(self, num_shards: int = 100):
self.num_shards = num_shards
self.shard_load: dict[int, int] = defaultdict(int)
self.hot_posts: set[str] = set()
def get_shard(self, post_id: str) -> int:
return hash(post_id) % self.num_shards
def record_access(self, post_id: str):
shard = self.get_shard(post_id)
self.shard_load[shard] += 1
# Detect hot shards
avg_load = sum(self.shard_load.values()) / self.num_shards
if self.shard_load[shard] > avg_load * 5:
self._handle_hot_shard(shard)
def _handle_hot_shard(self, shard: int):
# Options:
# 1. Add read replicas for this shard
# 2. Migrate some data to less loaded shards
# 3. Cache aggressively for this shard
print(f"⚠️ Hot shard detected: {shard}")
5.5 Key Lessons from Instagram
- Differentiate by popularity: Different strategies for different scales
- Accept approximation: Users don't need exact counts
- Defer work: Pull on read for expensive fan-outs
- Layer caching: CDN → App cache → Redis → Database
- Monitor and adapt: Real-time hot key detection triggers different code paths
Part III: Advanced Topics
Chapter 6: Write-Heavy Hot Keys
So far we've focused on read-heavy hot keys. Write-heavy hot keys are harder.
6.1 The Challenge
Scenario: Live view counter for a viral video
- 1 million concurrent viewers
- Counter incremented every second per viewer
- 1M writes/second to one key
You can't cache writes. Every increment must be recorded.
6.2 Buffering and Batching
Buffer writes locally, flush periodically:
class BufferedCounter:
"""
Buffer increments locally, flush to central store periodically.
Trades some accuracy and latency for throughput.
"""
def __init__(self, redis_client, flush_interval: float = 1.0):
self.redis = redis_client
self.flush_interval = flush_interval
self.local_counts: dict[str, int] = defaultdict(int)
self.lock = threading.Lock()
self._start_flusher()
def increment(self, key: str, amount: int = 1):
with self.lock:
self.local_counts[key] += amount
def _flush(self):
with self.lock:
counts_to_flush = dict(self.local_counts)
self.local_counts.clear()
if counts_to_flush:
pipe = self.redis.pipeline()
for key, count in counts_to_flush.items():
pipe.incrby(f"counter:{key}", count)
pipe.execute()
def _start_flusher(self):
def flush_loop():
while True:
time.sleep(self.flush_interval)
self._flush()
threading.Thread(target=flush_loop, daemon=True).start()
Trade-off: Counter is eventually consistent. During a server crash, buffered counts are lost.
6.3 Distributed Counting with CRDTs
Use a G-Counter (Grow-only Counter) CRDT:
class GCounter:
"""
CRDT G-Counter: Each node maintains its own count.
Total = sum of all node counts.
"""
def __init__(self, node_id: str, redis_client):
self.node_id = node_id
self.redis = redis_client
def increment(self, key: str, amount: int = 1):
# Increment only this node's counter
self.redis.hincrby(f"gcounter:{key}", self.node_id, amount)
def get_total(self, key: str) -> int:
# Sum all nodes' counters
counts = self.redis.hgetall(f"gcounter:{key}")
return sum(int(v) for v in counts.values())
def merge(self, key: str, other_counts: dict[str, int]):
# Merge by taking max of each node's count
pipe = self.redis.pipeline()
for node_id, count in other_counts.items():
# HSET with max
current = int(self.redis.hget(f"gcounter:{key}", node_id) or 0)
if count > current:
pipe.hset(f"gcounter:{key}", node_id, count)
pipe.execute()
Each server only writes to its own hash field—no contention!
6.4 Sharded Counters
Split counter across multiple keys:
class ShardedCounter:
"""
Shard a counter across multiple keys to reduce contention.
"""
def __init__(self, redis_client, num_shards: int = 100):
self.redis = redis_client
self.num_shards = num_shards
def increment(self, key: str, amount: int = 1):
# Pick a random shard
shard = random.randint(0, self.num_shards - 1)
self.redis.incrby(f"counter:{key}:{shard}", amount)
def get_count(self, key: str) -> int:
# Sum all shards
pipe = self.redis.pipeline()
for i in range(self.num_shards):
pipe.get(f"counter:{key}:{i}")
results = pipe.execute()
return sum(int(r or 0) for r in results)
Trade-off: Reads are O(num_shards) instead of O(1). Use when writes >> reads.
Chapter 7: Hot Keys in Different Systems
7.1 Hot Keys in Kafka
Kafka partitions messages by key. Hot keys cause partition imbalance.
# Problem: All messages for user "viral_celebrity" go to one partition
producer.send('events', key='viral_celebrity', value=event)
# Solution 1: Add random suffix for known hot keys
if user_id in known_hot_users:
partition_key = f"{user_id}:{random.randint(0, 9)}"
else:
partition_key = user_id
producer.send('events', key=partition_key, value=event)
# Solution 2: Use null key for hot events (round-robin partitioning)
if is_hot_event(event):
producer.send('events', key=None, value=event)
else:
producer.send('events', key=event.user_id, value=event)
7.2 Hot Keys in DynamoDB
DynamoDB throttles partitions, not tables. Hot partition = throttled requests.
# Problem: All items for "global_config" on one partition
table.get_item(Key={'pk': 'global_config', 'sk': 'settings'})
# Solution 1: Write-sharding with scatter-gather reads
def get_global_config():
shards = [f"global_config#{i}" for i in range(10)]
shard = random.choice(shards)
return table.get_item(Key={'pk': shard, 'sk': 'settings'})
# Solution 2: DAX (DynamoDB Accelerator) for read caching
dax_client = amazondax.AmazonDaxClient.resource(endpoint_url='...')
dax_table = dax_client.Table('my-table')
# Hot reads served from cache
7.3 Hot Keys in Elasticsearch
One popular document can make its shard a bottleneck.
# Solution: Route hot documents to dedicated indices
def get_index_for_document(doc_id: str, is_hot: bool) -> str:
if is_hot:
return 'hot_documents' # Index with more shards/replicas
else:
return 'documents'
# Hot index configuration
{
"settings": {
"number_of_shards": 10, # More shards = more parallelism
"number_of_replicas": 5 # More replicas = handle more reads
}
}
Part IV: Discussion and Trade-offs
Chapter 8: The Hard Questions
8.1 "Redesign URL shortener assuming 0.01% of URLs get 90% of traffic"
Complete answer:
"The core insight is that we need different handling for hot vs cold URLs.
For the 99.99% of cold URLs, our existing architecture works fine—hash partition by short_code, standard caching.
For the 0.01% hot URLs, we implement:
-
Detection: Count-Min Sketch tracks request frequency per URL with fixed memory. When a URL exceeds threshold (e.g., 1000 req/min), mark it hot.
-
Multi-layer caching:
- CDN edge: Cache redirects for 1 hour. 80%+ of hot traffic served here.
- Local app cache: 5-second TTL for hot URLs. Fast, no network hop.
- Redis with key splitting: Hot URLs split across 10 keys, spreading load.
-
Request coalescing: When cache expires and multiple requests hit simultaneously, only one fetches from DB; others wait.
-
Graceful promotion/demotion: URLs smoothly transition between hot and cold status. Promotion pre-warms caches; demotion cleans up split keys.
The result: hot URLs are served from memory with sub-millisecond latency. Database sees only cache misses—maybe 1% of total traffic.
Trade-offs I'm accepting:
- Slightly stale data for hot URLs (TTL-based caching)
- Memory overhead for tracking hot keys
- Complexity in the caching layer"
8.2 "How does Instagram handle hot celebrity posts?"
Complete answer:
"Instagram faces an extreme version of the hot key problem. A celebrity with 600M followers posting creates three challenges: feed distribution, interaction counting, and content serving.
Feed distribution: They use a hybrid push-pull model. For regular users (<10K followers), posts are pushed to followers' pre-computed feeds. For celebrities, posts are not fanned out—instead, they're pulled at read time and merged into the feed. This trades read latency for write scalability.
Interaction counting: Exact counts are unnecessary and expensive. They use approximate counting—probabilistically sample increments when counts exceed a threshold. Displaying '1.2M likes' doesn't need precision.
Content serving: Multi-layer caching with CDN at the edge. A viral photo is cached globally. The origin serves a tiny fraction of requests.
Key architectural insight: Celebrity accounts are fundamentally different entities in the system. They have different code paths, different storage strategies, even different SLAs. Instead of trying to make one system handle all scales, they explicitly bifurcate the architecture."
8.3 "How do you decide between caching, sharding, and replication for hot keys?"
Decision framework:
| Characteristic | Best Strategy |
|---|---|
| Read-heavy, tolerance for stale | Caching (simplest, most effective) |
| Read-heavy, needs freshness | Read replicas |
| Write-heavy | Sharded counters or batching |
| Predictable hot keys | Pre-provision dedicated resources |
| Unpredictable hot keys | Automatic detection + dynamic mitigation |
| Need strong consistency | Replication + careful invalidation |
"I'd start with caching—it solves 90% of hot key problems. Add replication for capacity. Use sharding only when writes are the bottleneck. The order matters because each adds complexity."
Chapter 9: Session Summary
What You Should Know Now
After this session, you should be able to:
- Explain why hot keys happen (Zipf's law, viral content, temporal patterns)
- Detect hot keys using streaming algorithms (Count-Min Sketch) and monitoring
- Apply multiple mitigation strategies (caching, splitting, replication, coalescing)
- Design systems for skewed workloads (URL shortener with 0.01% hot keys)
- Analyze real-world systems (Instagram's celebrity handling)
Key Trade-offs to Remember
| Decision | Trade-off |
|---|---|
| More caching | Lower latency vs Stale data |
| Key splitting | Distributed load vs Write amplification |
| Local caching | Fastest reads vs Consistency challenges |
| Approximate counting | Scalability vs Precision |
| Hybrid push/pull | Write scalability vs Read complexity |
Questions to Ask in Every Design
- What's the expected traffic distribution? (Uniform? Zipf?)
- Which keys might go viral? Can we predict them?
- What's acceptable staleness for hot keys?
- How do we detect when a key becomes hot?
- What's the fallback if our hot key mitigation fails?
Part V: Interview Questions and Answers
Chapter 10: Real-World Interview Scenarios
10.1 Conceptual Questions
Question 1: "What is a hot key and why is it a problem?"
Interviewer's Intent: Testing fundamental understanding.
Strong Answer:
"A hot key is a single key that receives a disproportionate share of traffic—often thousands of times more than the average key. It's a problem because in distributed systems, keys are typically partitioned across nodes. All requests for a hot key go to one node, creating a bottleneck while other nodes sit idle.
The impact is severe: the hot partition becomes overloaded, latency spikes, and if health checks fail, it can cascade to other nodes. I've seen a single viral URL take down an entire URL shortener because it overwhelmed its partition.
The underlying cause is usually Zipf's law—in most systems, a small fraction of items (1%) get most of the traffic (80%+). Viral content, celebrity accounts, and popular products are common examples."
Question 2: "Explain the Count-Min Sketch and why it's useful for hot key detection."
Interviewer's Intent: Testing knowledge of probabilistic data structures.
Strong Answer:
"Count-Min Sketch is a probabilistic data structure for frequency estimation with fixed memory. It uses a 2D array of counters and multiple hash functions.
To increment: hash the key with each function, increment the counter at each resulting position. To query: hash the key, return the minimum of all counters (hence 'min').
It has interesting properties:
- May overestimate counts (hash collisions add to counts)
- Never underestimates
- Memory is fixed regardless of unique key count
- Accuracy improves with more hash functions and wider array
For hot key detection, it's perfect because:
- We need approximate counts, not exact (just need to know if key is 'hot')
- We can't store timestamps for every request (would need unbounded memory)
- False positives are acceptable (treating a warm key as hot is fine)
- False negatives are rare (we won't miss a truly hot key)
I'd use it as a first filter, then do exact counting only for keys that pass the threshold."
Question 3: "Compare different strategies for handling write-heavy hot keys."
Interviewer's Intent: Testing depth of knowledge.
Strong Answer:
"Write-heavy hot keys are harder than read-heavy because you can't cache writes. Three main strategies:
Buffering/Batching: Each app server buffers writes locally, flushes periodically to the database. A like counter might buffer for 1 second before sending an INCRBY for the sum. Trade-off: Potential data loss if server crashes, latency in seeing updates.
Sharded Counters: Split one counter into N counters across partitions. Writes go to a random shard; reads sum all shards. Trade-off: Reads become O(N) instead of O(1). Best when writes vastly exceed reads.
CRDTs (G-Counters): Each node maintains its own count. Total is the sum. Nodes can increment independently without coordination—perfect for distributed counting. Trade-off: More storage (one field per node), requires aggregation for reads.
For a live viewer count with 1M updates/second, I'd use sharded counters with maybe 1000 shards. That's 1000 updates/second per shard—manageable. Reads can be cached since they don't need to be perfectly real-time."
10.2 Design Questions
Question 4: "Design a system to handle trending hashtags on Twitter."
Interviewer's Intent: Testing end-to-end design for hot keys.
Strong Answer:
"Trending hashtags are the epitome of hot keys—previously unknown hashtags suddenly get millions of tweets.
Detection layer: Stream processing (Kafka + Flink) counts hashtag occurrences in sliding windows. Compare current rate to historical baseline. Hashtag with 10x normal rate → potentially trending.
Storage strategy: Two tiers:
- Cold hashtags: Standard partitioning by hashtag, normal caching
- Hot/trending hashtags: Dedicated 'trending' partition with extra replicas, aggressive caching, sharded counters for counts
Counting: Since we're displaying approximate counts ('10.2K tweets'), use probabilistic counting. Sample 1 in 100 tweets for very hot hashtags. HyperLogLog for unique user counts.
Serving: The 'trending' page itself is a hot key. Cache it at CDN edge with 30-second TTL. Different regions can show slightly different trending lists—eventual consistency is fine.
Automatic scaling: When a hashtag crosses threshold, automatically:
- Add it to local caches on all app servers
- Create sharded counters
- Route to dedicated infrastructure
- Alert ops team
When it cools down (rate drops below threshold for 1 hour), demote back to normal handling.
Key insight: Treat trending hashtags as a special class with dedicated infrastructure, not as regular hashtags that happen to be popular."
Question 5: "Design a view counter for YouTube videos."
Interviewer's Intent: Testing design for extreme write volumes.
Strong Answer:
"YouTube's challenge: Some videos get millions of views per hour. Every view needs counting.
Requirements clarification:
- Eventual consistency is fine (view count can lag)
- Approximate counts acceptable for display (1.2M not 1,234,567)
- Need accurate counts for monetization (but can be delayed)
Architecture:
View event → Kafka → Stream Processor → Real-time Counter
│
└→ Batch Processor → Accurate Counter (hourly)
Real-time path (for display):
- Events buffered per app server, flushed every second
- Sharded counters in Redis (1000 shards for hot videos)
- Read aggregates all shards, caches result for 10 seconds
Batch path (for monetization):
- All events stored in data lake
- Hourly job computes exact counts with deduplication
- These are the 'official' counts for revenue calculation
Hot video detection:
- Count-Min Sketch identifies videos exceeding threshold
- Hot videos get promoted to dedicated counter infrastructure
- Separate Redis cluster for just hot videos
Display optimization:
- Cache view counts at CDN
- Format as '1.2M views' (no need for precision)
- Update cached count every 30 seconds
Trade-offs:
- Real-time count may differ from batch count (eventual consistency)
- Display count is approximate (user expectation managed via formatting)
- Infrastructure cost scales with hot video count, not total video count"
10.3 Scenario-Based Questions
Question 6: "Your database partition is getting hot due to a viral product. What do you do?"
Interviewer's Intent: Testing incident response.
Strong Answer:
"Immediate triage:
-
Confirm the issue: Check metrics—is it actually one product causing the load? Look at query patterns, partition load metrics.
-
Immediate mitigation:
- Add the product to application-level cache with longer TTL
- If possible, route reads to replicas
- Consider temporarily reducing cache TTL for other products (free up cache space)
-
If still overloaded:
- Add read replicas to the hot partition specifically
- Implement request rate limiting for that product's API endpoints
- Last resort: return slightly stale data (serve from cache even if expired, backfill async)
After immediate crisis:
-
Longer-term fixes:
- Implement hot key detection to catch this automatically next time
- Add infrastructure for hot key handling (key splitting, dedicated caches)
- Set up alerts for partition load imbalance
-
Post-incident review:
- Why didn't we detect this before it caused problems?
- How can we predict viral products? (Marketing coordination, trending detection)
- What's the cost of permanent hot-key infrastructure vs occasional incidents?
The key is layered response: immediate cache, then replicas, then architectural changes. Don't over-engineer for every possible hot key, but have a playbook ready."
Question 7: "How would you test your hot key handling?"
Interviewer's Intent: Testing quality and operational thinking.
Strong Answer:
"Testing hot keys requires both synthetic tests and production validation.
Unit tests: Test individual components
def test_count_min_sketch_detects_hot_key():
cms = CountMinSketch()
for _ in range(10000):
cms.increment('hot_key')
for _ in range(10):
cms.increment('cold_key')
assert cms.estimate('hot_key') > 9000
assert cms.estimate('cold_key') < 100
def test_key_splitting_distributes_load():
splitter = SplitKeyStrategy(split_factor=10)
splitter.mark_hot('hot_key')
# Verify reads distribute across splits
splits_hit = set()
for _ in range(1000):
key = splitter.get_read_key('hot_key')
splits_hit.add(key.split(':')[-1])
assert len(splits_hit) == 10 # All splits used
Load tests: Simulate hot key traffic
def test_system_handles_hot_key():
# Generate traffic: 1000 req/sec to one key, 10 req/sec to others
# Verify:
# - Latency stays under SLA (p99 < 100ms)
# - No errors
# - Hot key detected within 60 seconds
# - Mitigation activates automatically
Chaos tests: Inject failures during hot key handling
- Kill cache node holding hot key
- Introduce latency in hot key detection
- Verify graceful degradation
Production testing:
- Shadow mode: Log what would happen, don't actually activate
- Canary: Route 1% of traffic through hot key path
- Game day: Artificially create hot key (with rate limits) and verify response
Monitoring to validate:
- Hot key detection latency (time from hot to detected)
- Mitigation activation time
- False positive rate (keys marked hot that weren't)
- Cache hit rate during hot key events"
10.4 Deep-Dive Questions
Question 8: "How do different databases handle hot partitions?"
Interviewer's Intent: Testing breadth of knowledge.
Strong Answer:
"Different databases have different approaches:
DynamoDB: Adaptive capacity automatically shifts throughput to hot partitions. But there's still a per-partition limit (~3000 RCU). For sustained hot keys, you need application-level handling like write sharding.
Cassandra: Token-aware clients can route around hot nodes, but fundamentally one token range = one set of replicas. You can add replicas or split the token range. Cassandra 4.0+ has improved hot partition handling with better compaction.
MongoDB: With sharded clusters, you can split hot chunks. MongoDB Atlas has auto-scaling based on load. But the shard key choice determines how traffic distributes—a bad shard key can't be fixed by auto-scaling.
Redis Cluster: Each slot has a primary + replicas. Hot slots can have reads served from replicas. But for write-hot keys, you're limited to what one node can handle. Key splitting at the application level is necessary.
CockroachDB/Spanner: Automatic range splitting when a range gets too hot. This is one of their selling points—the database handles hot spots transparently. But there are still limits per range.
Common theme: All databases have some hot key handling, but application-level mitigation is usually necessary for extreme cases. The database can buy you time; it can't solve the fundamental problem of one logical key = one location."
Question 9: "Explain how you'd implement automatic hot key detection and mitigation."
Interviewer's Intent: Testing system design depth.
Strong Answer:
"I'd build it as a service that plugs into the request path:
Request → Hot Key Service → Application → Response
│
└→ Prometheus (metrics)
└→ Action Triggers
Detection pipeline:
-
Streaming counter: Count-Min Sketch updated on every request. Fixed memory, O(1) updates.
-
Threshold checker: Every second, scan for keys exceeding threshold. Use approximate top-k algorithm (Space-Saving or Lossy Counting).
-
Confirmation: Keys exceeding CMS threshold get promoted to exact counting for 10 seconds. If exact count confirms, key is 'hot'.
-
Classification: Determine hot key type:
- Read-heavy: 90%+ reads → caching strategy
- Write-heavy: 90%+ writes → sharding strategy
- Mixed: hybrid approach
Mitigation actions:
class HotKeyMitigator:
def on_hot_key_detected(self, key: str, key_type: str):
if key_type == 'read_heavy':
self.enable_aggressive_caching(key)
self.enable_local_caching(key)
self.notify_cdn_to_cache(key)
elif key_type == 'write_heavy':
self.enable_key_splitting(key)
self.enable_write_buffering(key)
self.alert_ops_team(key)
def on_hot_key_cooled(self, key: str):
self.disable_aggressive_caching(key)
self.cleanup_split_keys(key)
self.alert_ops_team_resolved(key)
Observability:
- Dashboard: Current hot keys, their type, mitigation active
- Alerts: New hot key detected, mitigation failed, hot key causing latency impact
- Logs: Detailed timeline for debugging
Failure modes:
- Detection too slow: Increase sampling rate
- False positives: Tune threshold, add confirmation period
- Mitigation fails: Fallback to rate limiting to protect system
The system should be autonomous for most cases but page humans for anomalies."
Chapter 11: Interview Preparation Checklist
Before your interview, make sure you can:
Concepts
- Explain Zipf's law and why traffic is skewed
- Describe Count-Min Sketch and Space-Saving algorithms
- List five strategies for hot key mitigation with trade-offs
Detection
- Design a streaming hot key detector
- Set up monitoring and alerting for hot keys
- Explain predictive vs reactive detection
Mitigation
- Implement multi-layer caching
- Design key splitting strategy
- Handle write-heavy hot keys (batching, sharding)
Real Systems
- Explain how Instagram handles celebrity accounts
- Know how DynamoDB, Redis, Cassandra handle hot partitions
- Design for specific hot key scenarios (trending hashtags, viral videos)
Exercises
Exercise 1: Hot Key Detection
Implement a hot key detector that:
- Uses Count-Min Sketch for initial filtering
- Promotes suspected hot keys to exact counting
- Supports dynamic threshold adjustment
- Reports hot key statistics
Exercise 2: Cache Stampede Prevention
Implement a cache that prevents stampedes when hot keys expire:
- Probabilistic early expiration
- Request coalescing
- Background refresh for hot keys
- Measure and report hit rates
Exercise 3: Instagram Feed Design
Design the feed generation system for a social network:
- 500M users
- Average user follows 200 accounts
- 0.01% of accounts have >1M followers
- Show 100 most recent posts from followed accounts
- Handle both posting and reading efficiently
Further Reading
- "Designing Data-Intensive Applications" Chapter 6: Partitioning and hot spots
- Instagram Engineering Blog: "Sharding & IDs at Instagram"
- Discord Blog: "How Discord Stores Billions of Messages"
- Twitter Blog: "How We Index 250 Million Tweets a Day"
- Amazon DynamoDB Paper: Adaptive capacity and hot partition handling
Appendix: Code Reference
A.1 Production Hot Key Handler
import threading
import time
import random
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict, Set, Optional, Callable
from enum import Enum
import mmh3
import numpy as np
class HotKeyType(Enum):
READ_HEAVY = "read_heavy"
WRITE_HEAVY = "write_heavy"
MIXED = "mixed"
@dataclass
class HotKeyInfo:
key: str
detected_at: float
request_count: int
key_type: HotKeyType
mitigation_active: bool = False
class ProductionHotKeyHandler:
"""
Production-ready hot key detection and mitigation system.
Features:
- Count-Min Sketch for memory-efficient detection
- Automatic classification (read/write heavy)
- Pluggable mitigation strategies
- Graceful promotion/demotion
- Comprehensive metrics
"""
def __init__(
self,
hot_threshold: int = 1000, # requests per window
window_seconds: int = 60,
cms_width: int = 10000,
cms_depth: int = 7,
on_hot_key: Optional[Callable[[HotKeyInfo], None]] = None,
on_cool_key: Optional[Callable[[str], None]] = None
):
self.hot_threshold = hot_threshold
self.window_seconds = window_seconds
self.on_hot_key = on_hot_key
self.on_cool_key = on_cool_key
# Count-Min Sketch for approximate counting
self.cms = np.zeros((cms_depth, cms_width), dtype=np.int64)
self.cms_depth = cms_depth
self.cms_width = cms_width
# Exact counting for suspected hot keys
self.suspected_keys: Dict[str, list] = defaultdict(list)
# Confirmed hot keys
self.hot_keys: Dict[str, HotKeyInfo] = {}
# Read/write tracking for classification
self.read_counts: Dict[str, int] = defaultdict(int)
self.write_counts: Dict[str, int] = defaultdict(int)
# Thread safety
self.lock = threading.RLock()
# Background tasks
self._start_background_tasks()
# Metrics
self.metrics = {
'total_requests': 0,
'hot_key_detections': 0,
'false_positives': 0,
'mitigations_activated': 0
}
def record_read(self, key: str) -> bool:
"""Record a read operation. Returns True if key is hot."""
return self._record(key, is_write=False)
def record_write(self, key: str) -> bool:
"""Record a write operation. Returns True if key is hot."""
return self._record(key, is_write=True)
def _record(self, key: str, is_write: bool) -> bool:
with self.lock:
self.metrics['total_requests'] += 1
# Track read/write ratio
if is_write:
self.write_counts[key] += 1
else:
self.read_counts[key] += 1
# Update Count-Min Sketch
self._cms_increment(key)
estimate = self._cms_estimate(key)
# Check if already confirmed hot
if key in self.hot_keys:
return True
# Check if exceeds threshold for suspicion
if estimate >= self.hot_threshold * 0.8:
self._track_suspected(key)
return False
def _cms_increment(self, key: str):
for i in range(self.cms_depth):
idx = mmh3.hash(key, i) % self.cms_width
self.cms[i][idx] += 1
def _cms_estimate(self, key: str) -> int:
min_count = float('inf')
for i in range(self.cms_depth):
idx = mmh3.hash(key, i) % self.cms_width
min_count = min(min_count, self.cms[i][idx])
return int(min_count)
def _track_suspected(self, key: str):
"""Track suspected hot key with exact timestamps."""
now = time.time()
self.suspected_keys[key].append(now)
# Check if confirmed hot
cutoff = now - self.window_seconds
recent = [t for t in self.suspected_keys[key] if t > cutoff]
self.suspected_keys[key] = recent
if len(recent) >= self.hot_threshold:
self._confirm_hot(key, len(recent))
def _confirm_hot(self, key: str, count: int):
"""Confirm a key as hot and trigger mitigation."""
if key in self.hot_keys:
return
key_type = self._classify_key(key)
info = HotKeyInfo(
key=key,
detected_at=time.time(),
request_count=count,
key_type=key_type
)
self.hot_keys[key] = info
self.metrics['hot_key_detections'] += 1
print(f"🔥 HOT KEY DETECTED: {key} (type: {key_type.value}, count: {count})")
if self.on_hot_key:
self.on_hot_key(info)
def _classify_key(self, key: str) -> HotKeyType:
"""Classify hot key as read-heavy, write-heavy, or mixed."""
reads = self.read_counts.get(key, 0)
writes = self.write_counts.get(key, 0)
total = reads + writes
if total == 0:
return HotKeyType.MIXED
read_ratio = reads / total
if read_ratio >= 0.9:
return HotKeyType.READ_HEAVY
elif read_ratio <= 0.1:
return HotKeyType.WRITE_HEAVY
else:
return HotKeyType.MIXED
def _check_cooled_keys(self):
"""Check if any hot keys have cooled down."""
now = time.time()
cooled = []
with self.lock:
for key, info in list(self.hot_keys.items()):
# Check current rate
recent = [t for t in self.suspected_keys.get(key, [])
if t > now - self.window_seconds]
if len(recent) < self.hot_threshold * 0.5: # 50% of threshold
cooled.append(key)
for key in cooled:
self._cool_key(key)
def _cool_key(self, key: str):
"""Handle a key cooling down."""
with self.lock:
if key in self.hot_keys:
del self.hot_keys[key]
print(f"❄️ Key cooled: {key}")
if self.on_cool_key:
self.on_cool_key(key)
def _reset_cms(self):
"""Periodically reset CMS to avoid count accumulation."""
with self.lock:
self.cms.fill(0)
self.read_counts.clear()
self.write_counts.clear()
def _start_background_tasks(self):
"""Start background monitoring tasks."""
def check_loop():
while True:
time.sleep(5) # Check every 5 seconds
self._check_cooled_keys()
def reset_loop():
while True:
time.sleep(self.window_seconds)
self._reset_cms()
threading.Thread(target=check_loop, daemon=True).start()
threading.Thread(target=reset_loop, daemon=True).start()
def is_hot(self, key: str) -> bool:
"""Check if a key is currently hot."""
return key in self.hot_keys
def get_hot_keys(self) -> Dict[str, HotKeyInfo]:
"""Get all currently hot keys."""
return dict(self.hot_keys)
def get_metrics(self) -> dict:
"""Get handler metrics."""
return {
**self.metrics,
'current_hot_keys': len(self.hot_keys),
'suspected_keys': len(self.suspected_keys)
}
# Example mitigation strategies
class HotKeyMitigationStrategies:
def __init__(self, cache, db):
self.cache = cache
self.db = db
self.split_factor = 10
self.split_keys: Set[str] = set()
def handle_hot_key(self, info: HotKeyInfo):
"""Apply appropriate mitigation based on key type."""
if info.key_type == HotKeyType.READ_HEAVY:
self._mitigate_read_heavy(info.key)
elif info.key_type == HotKeyType.WRITE_HEAVY:
self._mitigate_write_heavy(info.key)
else:
self._mitigate_mixed(info.key)
def _mitigate_read_heavy(self, key: str):
"""Mitigation for read-heavy hot keys."""
print(f"Applying read-heavy mitigation for {key}")
# 1. Extend cache TTL
value = self.cache.get(key) or self.db.get(key)
if value:
self.cache.setex(key, 3600, value) # 1 hour TTL
# 2. Enable key splitting for cache reads
self._enable_split_reads(key, value)
def _mitigate_write_heavy(self, key: str):
"""Mitigation for write-heavy hot keys."""
print(f"Applying write-heavy mitigation for {key}")
# Enable sharded writes
self.split_keys.add(key)
def _mitigate_mixed(self, key: str):
"""Mitigation for mixed hot keys."""
print(f"Applying mixed mitigation for {key}")
# Apply both strategies
value = self.cache.get(key) or self.db.get(key)
self._enable_split_reads(key, value)
self.split_keys.add(key)
def _enable_split_reads(self, key: str, value):
"""Distribute value across split keys."""
if value:
for i in range(self.split_factor):
self.cache.setex(f"{key}:{i}", 3600, value)
self.split_keys.add(key)
def handle_cool_key(self, key: str):
"""Clean up when key cools down."""
print(f"Cleaning up mitigation for {key}")
# Remove split keys
for i in range(self.split_factor):
self.cache.delete(f"{key}:{i}")
self.split_keys.discard(key)
def get(self, key: str):
"""Get with hot key awareness."""
if key in self.split_keys:
# Read from random split
split_key = f"{key}:{random.randint(0, self.split_factor - 1)}"
return self.cache.get(split_key)
return self.cache.get(key)
def set(self, key: str, value: str):
"""Set with hot key awareness."""
if key in self.split_keys:
# Write to all splits
for i in range(self.split_factor):
self.cache.setex(f"{key}:{i}", 3600, value)
else:
self.cache.setex(key, 300, value)
End of Day 4: Hot Keys and Skew
Tomorrow: Day 5 — Session Store Design. We'll tie together everything from this week by designing a complete session store for 10M concurrent users, including partitioning, replication, and handling datacenter failover.