Week 1 — Day 3: Rate Limiting at Scale
System Design Mastery Series
Preface
Days 1 and 2 taught you how to store and replicate data. Today, you learn how to protect those systems from being overwhelmed.
Rate limiting sounds simple: count requests, reject when over limit. But at scale, rate limiting becomes a distributed systems problem with no perfect solution. You'll face questions like:
- How do you count requests across 50 servers?
- What happens when your rate limit store fails?
- Is it better to accidentally allow too many requests, or accidentally block legitimate ones?
By the end of this session, you'll understand the algorithms, the distributed coordination challenges, and the failure modes that make rate limiting one of the most deceptively complex components in modern systems.
Let's begin.
Part I: Foundations
Chapter 1: Why Rate Limiting Matters
1.1 The Threats You're Defending Against
Rate limiting protects your system from several classes of problems:
Threat 1: Denial of Service (DoS)
A malicious actor floods your API with requests, consuming all available resources. Legitimate users can't get through.
Without rate limiting: One bad actor with a botnet can take down your entire service.
With rate limiting: Each IP/user is limited to N requests per second. The attacker is throttled; legitimate users continue normally.
Threat 2: Resource Exhaustion
A single customer accidentally runs a script that makes 1 million API calls in a loop. They didn't mean to, but they're consuming resources meant for all customers.
Without rate limiting: One runaway script degrades service for everyone.
With rate limiting: The script hits its limit, gets throttled, and the customer fixes their code.
Threat 3: Cost Control
Your service calls expensive downstream APIs (payment processors, AI models, third-party data). Each call costs money.
Without rate limiting: A bug or abuse could generate a massive bill.
With rate limiting: You cap the maximum spend per customer per time period.
Threat 4: Fair Resource Allocation
You have 100 customers sharing infrastructure. One customer is 10x larger than the others. Without limits, they'd consume most resources.
With rate limiting: Each customer gets a fair share (or a share proportional to their plan).
1.2 The Rate Limiting Contract
A rate limit is a contract with your users:
"You can make up to N requests per time period T. If you exceed this, we'll reject additional requests until the period resets."
Example contracts:
- 100 requests per minute per API key
- 1000 requests per hour per IP address
- 10 requests per second per user (with bursting up to 50)
The contract must be:
- Clear: Users know their limits before hitting them
- Predictable: Same behavior every time
- Fair: Applied consistently across all users
- Communicated: Response headers tell users their remaining quota
1.3 Where Rate Limiting Lives
Rate limiting can be implemented at multiple layers:
Internet
│
▼
┌─────────────────┐
│ CDN / Edge │ ← Layer 1: Geographic rate limiting, DDoS protection
└────────┬────────┘
│
▼
┌─────────────────┐
│ Load Balancer │ ← Layer 2: Connection limits, basic throttling
└────────┬────────┘
│
▼
┌─────────────────┐
│ API Gateway │ ← Layer 3: Per-API-key limits, quotas
└────────┬────────┘
│
▼
┌─────────────────┐
│ Application │ ← Layer 4: Business logic limits (e.g., 5 password attempts)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Database │ ← Layer 5: Connection pools, query limits
└─────────────────┘
Today, we focus on Layers 3 and 4: API-level rate limiting that requires distributed coordination.
Chapter 2: Rate Limiting Algorithms
There are four major algorithms for rate limiting. Each has different characteristics for burstiness, memory usage, and accuracy.
2.1 Fixed Window Counter
How It Works
Divide time into fixed windows (e.g., 1-minute intervals). Count requests in each window. Reset the counter when the window changes.
Window 1 (00:00-00:59): [||||||||||] 10 requests → Allow
Window 2 (01:00-01:59): [||||||||||||||||||||] 20 requests → Allow 10, reject 10
Window 3 (02:00-02:59): [|||] 3 requests → Allow
Implementation:
class FixedWindowLimiter:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.counters: dict[str, tuple[int, int]] = {} # key -> (count, window_start)
def is_allowed(self, key: str) -> bool:
now = int(time.time())
window_start = now - (now % self.window_seconds)
count, stored_window = self.counters.get(key, (0, window_start))
# New window - reset counter
if stored_window != window_start:
count = 0
stored_window = window_start
if count < self.limit:
self.counters[key] = (count + 1, stored_window)
return True
return False
Strengths
- Simple: Easy to understand and implement
- Memory efficient: One counter per key per window
- Fast: O(1) lookup and update
Weaknesses
- Boundary problem: A user can make 2x the limit at the boundary between windows
Limit: 10 requests per minute
Window 1 (00:00-00:59): User makes 10 requests at 00:55
Window 2 (01:00-01:59): User makes 10 requests at 01:05
Result: 20 requests in 10 seconds (00:55 to 01:05)
But each window only saw 10 → both allowed!
This "burst at the boundary" can double the effective rate.
2.2 Sliding Window Log
How It Works
Store the timestamp of each request. To check if a new request is allowed, count requests in the past window (sliding).
Limit: 10 requests per 60 seconds
Current time: 12:05:30
Request log for user: [12:04:35, 12:04:40, 12:05:00, 12:05:10, 12:05:15]
↑ older than 60s ago, ignored
Requests in window (12:04:30 to 12:05:30): 4
New request allowed: Yes (4 < 10)
Implementation:
class SlidingWindowLogLimiter:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.logs: dict[str, list[float]] = {}
def is_allowed(self, key: str) -> bool:
now = time.time()
window_start = now - self.window_seconds
# Get existing timestamps, filter out old ones
timestamps = self.logs.get(key, [])
timestamps = [t for t in timestamps if t > window_start]
if len(timestamps) < self.limit:
timestamps.append(now)
self.logs[key] = timestamps
return True
return False
Strengths
- Accurate: No boundary problem—the window truly slides
- Precise: Exact count of requests in any rolling window
Weaknesses
- Memory intensive: Must store timestamp of every request
- Cleanup overhead: Need to periodically remove old timestamps
- Slow at scale: Counting requests requires scanning the log
For high-volume APIs (thousands of requests per second per key), this becomes expensive.
2.3 Sliding Window Counter
How It Works
A hybrid approach: use fixed windows but weight them based on how much of the current sliding window they cover.
Limit: 10 requests per 60 seconds
Current time: 12:01:15 (15 seconds into the new minute)
Previous window (12:00:00-12:00:59): 8 requests
Current window (12:01:00-12:01:59): 3 requests (so far)
Weighted count = (previous × overlap%) + current
= (8 × 75%) + 3
= 6 + 3
= 9
9 < 10 → Request allowed
The "overlap%" is how much of the previous window is still within our 60-second sliding window:
- At 12:01:15, we're 15 seconds into the new window
- So the previous window contributes 75% (45 seconds of overlap / 60 seconds)
Implementation:
class SlidingWindowCounterLimiter:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.counters: dict[str, dict] = {}
def is_allowed(self, key: str) -> bool:
now = time.time()
current_window = int(now // self.window_seconds)
window_position = now % self.window_seconds
data = self.counters.get(key, {
'prev_count': 0,
'prev_window': current_window - 1,
'curr_count': 0,
'curr_window': current_window
})
# Handle window transitions
if data['curr_window'] < current_window:
data['prev_count'] = data['curr_count'] if data['curr_window'] == current_window - 1 else 0
data['prev_window'] = data['curr_window']
data['curr_count'] = 0
data['curr_window'] = current_window
# Calculate weighted count
prev_weight = (self.window_seconds - window_position) / self.window_seconds
weighted_count = (data['prev_count'] * prev_weight) + data['curr_count']
if weighted_count < self.limit:
data['curr_count'] += 1
self.counters[key] = data
return True
return False
Strengths
- Accurate enough: Smooths the boundary problem without storing every timestamp
- Memory efficient: Only two counters per key
- Fast: O(1) operations
Weaknesses
- Approximate: Not perfectly accurate (weighted estimate, not exact count)
- Slightly more complex: Than fixed window
This is the sweet spot for most production systems.
2.4 Token Bucket
How It Works
Imagine a bucket that holds tokens. The bucket fills at a constant rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected.
The bucket has a maximum capacity, allowing for controlled bursting.
Configuration: 10 tokens/second rate, 50 token capacity
Time 0: Bucket has 50 tokens (full)
30 requests arrive → 30 allowed, 20 tokens remain
Time 1: Bucket refills 10 tokens → 30 tokens
5 requests arrive → 5 allowed, 25 tokens remain
Time 2: Bucket refills 10 tokens → 35 tokens
0 requests → 35 tokens remain
Time 3: Bucket refills 10 tokens → 45 tokens
60 requests → 45 allowed, 15 rejected, 0 tokens remain
Implementation:
class TokenBucketLimiter:
def __init__(self, rate: float, capacity: int):
self.rate = rate # tokens per second
self.capacity = capacity
self.buckets: dict[str, tuple[float, float]] = {} # key -> (tokens, last_update)
def is_allowed(self, key: str, tokens_requested: int = 1) -> bool:
now = time.time()
current_tokens, last_update = self.buckets.get(key, (self.capacity, now))
# Add tokens based on time elapsed
elapsed = now - last_update
current_tokens = min(self.capacity, current_tokens + (elapsed * self.rate))
if current_tokens >= tokens_requested:
current_tokens -= tokens_requested
self.buckets[key] = (current_tokens, now)
return True
else:
self.buckets[key] = (current_tokens, now)
return False
Strengths
- Allows bursting: Users can burst up to bucket capacity
- Smooth rate over time: Long-term rate converges to the configured rate
- Flexible: Can charge different costs for different operations
Weaknesses
- Bursting can be problematic: If all users burst simultaneously, you might exceed capacity
- Slightly more state: Need to track tokens and timestamp per key
2.5 Leaky Bucket
How It Works
Requests enter a bucket (queue). The bucket "leaks" at a constant rate—requests are processed at a fixed rate regardless of arrival pattern.
Configuration: Process 10 requests/second, queue capacity 50
Requests arrive in burst: [30 requests at once]
Queue: [30 requests waiting]
Processing: 10/second
After 1 second: 10 processed, 20 in queue
After 2 seconds: 20 processed, 10 in queue
After 3 seconds: 30 processed, queue empty
Strengths
- Smooth output rate: Downstream systems see a constant load
- Prevents bursting: Output is always at the configured rate
Weaknesses
- Adds latency: Requests wait in queue
- Queue management: Need to handle full queue (reject new requests)
- Not suitable for real-time: Users wait even when system has capacity
Leaky bucket is common for smoothing traffic to downstream services, not for user-facing rate limits.
2.6 Algorithm Comparison
| Algorithm | Accuracy | Memory | Burstiness | Best For |
|---|---|---|---|---|
| Fixed Window | Low | Very Low | High (2x at boundary) | Simple use cases |
| Sliding Window Log | Exact | High | None | When accuracy is critical |
| Sliding Window Counter | Good | Low | Low | Most API rate limits |
| Token Bucket | Configurable | Low | Controlled | Allowing bursts, varied costs |
| Leaky Bucket | N/A | Medium | None | Smoothing to downstream |
Recommendation: Start with sliding window counter for API rate limiting. Use token bucket if you need controlled bursting.
Chapter 3: Distributed Rate Limiting
Everything so far assumed a single server. In reality, you have 50 servers, and requests from the same user can hit any of them.
3.1 The Coordination Problem
User makes 100 requests
Limit: 10 requests/minute
Without coordination:
Server 1: Sees 2 requests → Allows all
Server 2: Sees 3 requests → Allows all
Server 3: Sees 2 requests → Allows all
...
Server 50: Sees 2 requests → Allows all
Total allowed: 100 (should have been 10!)
Each server only sees a fraction of the traffic. Without coordination, users can exceed limits by spreading requests across servers.
3.2 Solution 1: Centralized Counter (Redis)
Store counters in a central store that all servers access.
┌───────────────────────────────────────────┐
│ Application Servers │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ S1 │ │ S2 │ │ S3 │ ... │ S50 │ │
│ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
└─────┼──────┼──────┼─────────────┼────────┘
│ │ │ │
└──────┴──────┴─────────────┘
│
▼
┌─────────────────┐
│ Redis Cluster │
│ (Counters) │
└─────────────────┘
Implementation with Redis:
import redis
import time
class RedisRateLimiter:
def __init__(self, redis_client: redis.Redis, limit: int, window_seconds: int):
self.redis = redis_client
self.limit = limit
self.window_seconds = window_seconds
def is_allowed(self, key: str) -> tuple[bool, dict]:
"""
Returns (allowed, info) where info contains remaining quota and reset time.
Uses sliding window counter algorithm.
"""
now = time.time()
window_key = f"ratelimit:{key}"
# Use Redis pipeline for atomic operations
pipe = self.redis.pipeline()
# Remove old entries (outside window)
window_start = now - self.window_seconds
pipe.zremrangebyscore(window_key, 0, window_start)
# Count current entries
pipe.zcard(window_key)
# Add current request (optimistically)
pipe.zadd(window_key, {f"{now}:{id(now)}": now})
# Set expiry on the key
pipe.expire(window_key, self.window_seconds)
results = pipe.execute()
current_count = results[1]
if current_count < self.limit:
remaining = self.limit - current_count - 1
return True, {
'allowed': True,
'remaining': max(0, remaining),
'reset': int(now + self.window_seconds),
'limit': self.limit
}
else:
# Over limit - remove the optimistically added entry
self.redis.zrem(window_key, f"{now}:{id(now)}")
return False, {
'allowed': False,
'remaining': 0,
'reset': int(now + self.window_seconds),
'limit': self.limit
}
Optimized Token Bucket in Redis (Lua script):
-- Token bucket rate limiter as a Lua script (atomic execution)
-- KEYS[1] = rate limit key
-- ARGV[1] = rate (tokens per second)
-- ARGV[2] = capacity (max tokens)
-- ARGV[3] = current timestamp
-- ARGV[4] = tokens requested
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
-- Get current state
local data = redis.call('HMGET', key, 'tokens', 'last_update')
local tokens = tonumber(data[1]) or capacity
local last_update = tonumber(data[2]) or now
-- Calculate token refill
local elapsed = now - last_update
local new_tokens = math.min(capacity, tokens + (elapsed * rate))
-- Check if request can be fulfilled
local allowed = 0
if new_tokens >= requested then
new_tokens = new_tokens - requested
allowed = 1
end
-- Update state
redis.call('HMSET', key, 'tokens', new_tokens, 'last_update', now)
redis.call('EXPIRE', key, math.ceil(capacity / rate) * 2)
return {allowed, new_tokens, capacity}
class RedisTokenBucket:
LUA_SCRIPT = """...""" # The Lua script above
def __init__(self, redis_client: redis.Redis, rate: float, capacity: int):
self.redis = redis_client
self.rate = rate
self.capacity = capacity
self.script = self.redis.register_script(self.LUA_SCRIPT)
def is_allowed(self, key: str, tokens: int = 1) -> tuple[bool, float]:
now = time.time()
result = self.script(
keys=[f"tokenbucket:{key}"],
args=[self.rate, self.capacity, now, tokens]
)
return bool(result[0]), result[1] # (allowed, remaining_tokens)
Strengths
- Accurate: All servers share the same counter
- Simple model: Easy to reason about
Weaknesses
- Single point of failure: If Redis goes down, what happens?
- Latency: Every request requires a Redis round-trip
- Scalability ceiling: Redis cluster has limits
3.3 Solution 2: Local Counters with Synchronization
Each server maintains local counters, periodically syncing with the central store.
Server 1: Server 2:
Local count: 5 Local count: 3
Last sync: 10 sec ago Last sync: 5 sec ago
│ │
└────────────┬─────────────────────┘
│ Periodic sync
▼
┌───────────────┐
│ Redis │
│ Global count │
└───────────────┘
Implementation:
import threading
import time
class LocalSyncRateLimiter:
def __init__(
self,
redis_client: redis.Redis,
local_limit: int,
global_limit: int,
window_seconds: int,
sync_interval: float = 1.0
):
self.redis = redis_client
self.local_limit = local_limit # Per-server allowance
self.global_limit = global_limit
self.window_seconds = window_seconds
self.sync_interval = sync_interval
self.local_counts: dict[str, int] = {}
self.lock = threading.Lock()
# Start sync thread
self.sync_thread = threading.Thread(target=self._sync_loop, daemon=True)
self.sync_thread.start()
def is_allowed(self, key: str) -> bool:
with self.lock:
count = self.local_counts.get(key, 0)
if count < self.local_limit:
self.local_counts[key] = count + 1
return True
return False
def _sync_loop(self):
while True:
time.sleep(self.sync_interval)
self._sync_to_redis()
def _sync_to_redis(self):
with self.lock:
counts_to_sync = self.local_counts.copy()
self.local_counts.clear()
pipe = self.redis.pipeline()
for key, count in counts_to_sync.items():
redis_key = f"ratelimit:{key}"
pipe.incrby(redis_key, count)
pipe.expire(redis_key, self.window_seconds)
pipe.execute()
Strengths
- Lower latency: Local checks are fast
- Resilient: Can continue (briefly) if Redis is unavailable
Weaknesses
- Approximate: Can exceed global limits between syncs
- More complex: Need to tune local vs global limits
- Coordination: Server count affects local limits
3.4 Solution 3: Sticky Sessions
Route requests from the same user to the same server. Each server handles a subset of users.
Load Balancer (sticky sessions by user_id)
│
┌─────────┼─────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Server1│ │Server2│ │Server3│
│User A │ │User B │ │User C │
│User D │ │User E │ │User F │
└───────┘ └───────┘ └───────┘
Each server tracks limits only for its assigned users—no coordination needed!
Strengths
- No coordination: Each server is authoritative for its users
- Fast: No remote calls for rate limiting
- Simple: Local data structures only
Weaknesses
- Uneven load: Some users are heavier than others
- Failover complexity: What happens when a server dies?
- Not always possible: Some architectures can't do sticky sessions
3.5 Comparing Distributed Approaches
| Approach | Accuracy | Latency | Failure Mode | Complexity |
|---|---|---|---|---|
| Centralized Redis | High | +1-2ms | Total failure if Redis down | Low |
| Local + Sync | Approximate | Very low | Continues with local limits | Medium |
| Sticky Sessions | High (per server) | Very low | Failover disruption | Low |
| Hybrid | High | Low | Graceful degradation | High |
Chapter 4: Failure Modes and Graceful Degradation
The most important question in distributed rate limiting:
When your rate limit store fails, do you over-allow (let requests through) or under-allow (reject requests)?
4.1 The Over-Allow vs Under-Allow Trade-off
Over-allow (fail open):
- If Redis is down, allow all requests
- Pro: Legitimate users aren't blocked
- Con: No protection during outage; abuse possible
Under-allow (fail closed):
- If Redis is down, reject all requests
- Pro: System protected from overload
- Con: Legitimate users blocked during outage
Neither is universally correct. The choice depends on what you're protecting.
4.2 Decision Framework
| Scenario | Recommendation | Reasoning |
|---|---|---|
| DDoS protection | Under-allow | Better to block everyone than let attack through |
| API monetization | Over-allow | Don't lose paying customers due to your bug |
| Resource protection | Under-allow | Protect backend systems from overload |
| User-facing rate limits | Over-allow | User experience > strict enforcement |
| Cost control (expensive APIs) | Under-allow | Financial risk of over-allowing is high |
4.3 Implementing Graceful Degradation
Hybrid approach: Local limits as fallback when centralized store fails.
class ResilientRateLimiter:
def __init__(
self,
redis_client: redis.Redis,
global_limit: int,
local_fallback_limit: int,
window_seconds: int,
failure_mode: str = 'over_allow' # or 'under_allow'
):
self.redis = redis_client
self.global_limit = global_limit
self.local_fallback_limit = local_fallback_limit
self.window_seconds = window_seconds
self.failure_mode = failure_mode
self.local_limiter = SlidingWindowCounterLimiter(
local_fallback_limit, window_seconds
)
self.redis_available = True
self.last_redis_check = 0
self.redis_check_interval = 5 # seconds
def is_allowed(self, key: str) -> tuple[bool, str]:
"""Returns (allowed, mode) where mode is 'redis', 'local', or 'failed'."""
# Try Redis first
if self._redis_is_available():
try:
return self._check_redis(key), 'redis'
except redis.RedisError:
self.redis_available = False
# Redis unavailable - use fallback strategy
if self.failure_mode == 'over_allow':
# Use local limiter as soft limit
allowed = self.local_limiter.is_allowed(key)
return allowed, 'local'
else:
# Fail closed - but maybe allow some traffic
if self.local_fallback_limit > 0:
# Very conservative local limit
allowed = self.local_limiter.is_allowed(key)
return allowed, 'local_strict'
return False, 'failed'
def _redis_is_available(self) -> bool:
now = time.time()
if not self.redis_available:
# Periodically retry Redis
if now - self.last_redis_check > self.redis_check_interval:
self.last_redis_check = now
try:
self.redis.ping()
self.redis_available = True
except redis.RedisError:
pass
return self.redis_available
def _check_redis(self, key: str) -> bool:
# Implement your Redis rate limit check here
...
4.4 Monitoring Rate Limiter Health
You need to know when your rate limiter is degraded:
from prometheus_client import Counter, Histogram, Gauge
# Metrics
rate_limit_decisions = Counter(
'rate_limit_decisions_total',
'Rate limit decisions',
['result', 'mode'] # result=allowed/denied, mode=redis/local/failed
)
rate_limit_latency = Histogram(
'rate_limit_check_seconds',
'Time to check rate limit',
buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1]
)
redis_available = Gauge(
'rate_limiter_redis_available',
'Whether Redis is available for rate limiting'
)
class InstrumentedRateLimiter:
def __init__(self, limiter: ResilientRateLimiter):
self.limiter = limiter
def is_allowed(self, key: str) -> bool:
with rate_limit_latency.time():
allowed, mode = self.limiter.is_allowed(key)
result = 'allowed' if allowed else 'denied'
rate_limit_decisions.labels(result=result, mode=mode).inc()
redis_available.set(1 if self.limiter.redis_available else 0)
return allowed
Key metrics to alert on:
- Redis availability dropping
- Rate limit latency increasing
- Unusual spike in denials
- Mode switching from 'redis' to 'local'
Part II: The Design Challenge
Chapter 5: Designing a Distributed Rate Limiter
You're designing a rate limiting service for your platform. Requirements:
- Scale: 10,000 requests per second across 50 application servers
- Limits: Per-user limits (100 req/min), per-API-key limits (1000 req/min), global limits
- Latency: Rate limit check must add < 5ms to request latency (p99)
- Accuracy: Within 10% of configured limits
- Resilience: Continue operating if Redis has a brief outage
5.1 Architecture Overview
┌────────────────────────────────────────────┐
│ Rate Limit Service │
│ │
┌─────────────┐ │ ┌────────────────────────────────────┐ │
│ Request │ │ │ Configuration Store │ │
│ from │───────────────┼──▶│ (Limit rules, per-tier configs) │ │
│ Client │ │ └────────────────────────────────────┘ │
└─────────────┘ │ │ │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ Rate Limit Logic │ │
│ │ ┌──────────┐ ┌──────────────┐ │ │
│ │ │ Local │ │ Redis │ │ │
│ │ │ Cache │◀─┼──▶ Cluster │ │ │
│ │ └──────────┘ └──────────────┘ │ │
│ └────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Decision │ │
│ │ Allow / Deny │ │
│ └───────────────┘ │
└────────────────────────────────────────────┘
5.2 Data Model
Rate limit rules (stored in configuration):
@dataclass
class RateLimitRule:
id: str
name: str
key_pattern: str # e.g., "user:{user_id}", "api_key:{api_key}"
limit: int
window_seconds: int
algorithm: str # 'sliding_window' or 'token_bucket'
burst_capacity: Optional[int] # For token bucket
scope: str # 'per_user', 'per_api_key', 'global'
tier: Optional[str] # 'free', 'pro', 'enterprise'
# Example rules
rules = [
RateLimitRule(
id="user_basic",
name="Basic user limit",
key_pattern="user:{user_id}",
limit=100,
window_seconds=60,
algorithm="sliding_window",
scope="per_user",
tier="free"
),
RateLimitRule(
id="user_pro",
name="Pro user limit",
key_pattern="user:{user_id}",
limit=1000,
window_seconds=60,
algorithm="sliding_window",
scope="per_user",
tier="pro"
),
RateLimitRule(
id="api_key",
name="API key limit",
key_pattern="apikey:{api_key}",
limit=10000,
window_seconds=3600,
algorithm="token_bucket",
burst_capacity=500,
scope="per_api_key"
),
RateLimitRule(
id="global_writes",
name="Global write limit",
key_pattern="global:writes",
limit=5000,
window_seconds=1,
algorithm="sliding_window",
scope="global"
)
]
5.3 Multi-Level Rate Limiting
A single request might be subject to multiple limits:
class MultiLevelRateLimiter:
def __init__(self, redis_client: redis.Redis, rules: list[RateLimitRule]):
self.redis = redis_client
self.rules = rules
self.limiters = self._create_limiters()
def _create_limiters(self) -> dict[str, RateLimiter]:
limiters = {}
for rule in self.rules:
if rule.algorithm == 'sliding_window':
limiters[rule.id] = RedisSlidingWindowLimiter(
self.redis, rule.limit, rule.window_seconds
)
elif rule.algorithm == 'token_bucket':
limiters[rule.id] = RedisTokenBucket(
self.redis,
rate=rule.limit / rule.window_seconds,
capacity=rule.burst_capacity or rule.limit
)
return limiters
def check_request(self, context: RequestContext) -> RateLimitResult:
"""
Check all applicable rules for this request.
Returns the most restrictive result.
"""
applicable_rules = self._get_applicable_rules(context)
results = []
for rule in applicable_rules:
key = self._build_key(rule, context)
limiter = self.limiters[rule.id]
allowed, info = limiter.is_allowed(key)
results.append(RateLimitResult(
rule_id=rule.id,
allowed=allowed,
limit=rule.limit,
remaining=info.get('remaining', 0),
reset=info.get('reset', 0)
))
# If any rule denies, the request is denied
denied_results = [r for r in results if not r.allowed]
if denied_results:
# Return the result with the longest wait time
return min(denied_results, key=lambda r: r.remaining)
# All rules allow - return the one with lowest remaining quota
return min(results, key=lambda r: r.remaining)
def _get_applicable_rules(self, context: RequestContext) -> list[RateLimitRule]:
applicable = []
for rule in self.rules:
# Check tier match
if rule.tier and rule.tier != context.user_tier:
continue
# Check scope
if rule.scope == 'per_user' and not context.user_id:
continue
if rule.scope == 'per_api_key' and not context.api_key:
continue
applicable.append(rule)
return applicable
def _build_key(self, rule: RateLimitRule, context: RequestContext) -> str:
key = rule.key_pattern
key = key.replace('{user_id}', context.user_id or 'anonymous')
key = key.replace('{api_key}', context.api_key or 'none')
key = key.replace('{ip}', context.ip_address)
return key
5.4 Response Headers
Communicate rate limit status to clients:
def add_rate_limit_headers(response: Response, result: RateLimitResult):
response.headers['X-RateLimit-Limit'] = str(result.limit)
response.headers['X-RateLimit-Remaining'] = str(result.remaining)
response.headers['X-RateLimit-Reset'] = str(result.reset)
if not result.allowed:
response.headers['Retry-After'] = str(result.reset - int(time.time()))
Standard headers:
X-RateLimit-Limit: The rate limit ceilingX-RateLimit-Remaining: Requests remaining in current windowX-RateLimit-Reset: Unix timestamp when the limit resetsRetry-After: Seconds until client should retry (on 429)
5.5 The Challenge Question: Redis Node Dies
Scenario: You have a 3-node Redis cluster. One node dies. What happens to rate limiting?
Analysis:
-
Keys on the dead node are temporarily inaccessible
- In Redis Cluster, keys are distributed across nodes
- ~33% of keys are on the dead node
- Those rate limit counters are unavailable
-
Cluster failover takes time
- Redis Cluster automatic failover: 10-30 seconds
- During this time, reads/writes to affected keys fail
-
Options during the outage:
Option A: Fail open for affected keys
def is_allowed(self, key: str) -> bool: try: return self._check_redis(key) except redis.ClusterDownError: # Redis node down - allow the request return TrueOption B: Fail closed for affected keys
def is_allowed(self, key: str) -> bool: try: return self._check_redis(key) except redis.ClusterDownError: # Redis node down - deny the request return FalseOption C: Fall back to local rate limiting
def is_allowed(self, key: str) -> bool: try: return self._check_redis(key) except redis.ClusterDownError: # Use conservative local limit return self.local_limiter.is_allowed(key) -
After failover completes:
- If replica was in sync, no data loss
- If replica was behind, some recent increments lost
- Rate limits might be slightly "reset" for affected keys
My recommendation: Option C (local fallback) with conservative limits. During the ~30 second outage:
- 66% of keys (other nodes) continue normally
- 33% of keys (affected node) use local fallback
- No users are completely blocked
- Some users might slightly exceed limits
- System remains available
5.6 Complete Implementation
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import redis
from typing import Optional
app = FastAPI()
class RateLimitMiddleware:
def __init__(self, app: FastAPI, limiter: MultiLevelRateLimiter):
self.app = app
self.limiter = limiter
async def __call__(self, request: Request, call_next):
# Extract context from request
context = RequestContext(
user_id=request.headers.get('X-User-ID'),
api_key=request.headers.get('X-API-Key'),
ip_address=request.client.host,
user_tier=request.headers.get('X-User-Tier', 'free'),
endpoint=request.url.path,
method=request.method
)
# Check rate limits
result = self.limiter.check_request(context)
if not result.allowed:
response = JSONResponse(
status_code=429,
content={
'error': 'rate_limit_exceeded',
'message': f'Rate limit exceeded. Retry after {result.reset - int(time.time())} seconds.',
'limit': result.limit,
'reset': result.reset
}
)
add_rate_limit_headers(response, result)
return response
# Proceed with request
response = await call_next(request)
add_rate_limit_headers(response, result)
return response
# Setup
redis_client = redis.RedisCluster(
host='redis-cluster.example.com',
port=6379,
decode_responses=True
)
limiter = MultiLevelRateLimiter(redis_client, rules)
app.middleware('http')(RateLimitMiddleware(app, limiter))
Part III: Advanced Topics
Chapter 6: Rate Limiting Patterns
6.1 Tiered Rate Limits
Different limits for different user tiers:
TIER_LIMITS = {
'free': {
'requests_per_minute': 60,
'requests_per_day': 1000,
'burst': 10
},
'pro': {
'requests_per_minute': 600,
'requests_per_day': 50000,
'burst': 100
},
'enterprise': {
'requests_per_minute': 6000,
'requests_per_day': None, # Unlimited
'burst': 1000
}
}
def get_limits_for_user(user: User) -> dict:
return TIER_LIMITS.get(user.tier, TIER_LIMITS['free'])
6.2 Endpoint-Specific Limits
Different endpoints have different costs:
ENDPOINT_COSTS = {
'GET /api/users': 1,
'POST /api/users': 5,
'GET /api/reports': 10, # Expensive query
'POST /api/ai/generate': 100, # Very expensive
}
class CostAwareRateLimiter:
def __init__(self, limiter: TokenBucketLimiter):
self.limiter = limiter
def check_request(self, user_id: str, endpoint: str) -> bool:
cost = ENDPOINT_COSTS.get(endpoint, 1)
return self.limiter.is_allowed(user_id, tokens=cost)
6.3 Adaptive Rate Limiting
Adjust limits based on system load:
class AdaptiveRateLimiter:
def __init__(self, base_limiter: RateLimiter, load_monitor: LoadMonitor):
self.limiter = base_limiter
self.load_monitor = load_monitor
def get_effective_limit(self, base_limit: int) -> int:
load = self.load_monitor.get_current_load() # 0.0 to 1.0
if load < 0.5:
# Low load - allow 20% more
return int(base_limit * 1.2)
elif load < 0.8:
# Normal load - use base limit
return base_limit
elif load < 0.95:
# High load - reduce by 20%
return int(base_limit * 0.8)
else:
# Critical load - reduce by 50%
return int(base_limit * 0.5)
6.4 Rate Limit Exemptions
Some requests should bypass rate limits:
EXEMPT_PATTERNS = [
'/health',
'/metrics',
'/api/webhooks/stripe', # Stripe webhook callbacks
]
EXEMPT_API_KEYS = {
'internal-service-key-123',
'monitoring-key-456',
}
def should_exempt(request: Request) -> bool:
# Check path exemptions
for pattern in EXEMPT_PATTERNS:
if request.url.path.startswith(pattern):
return True
# Check API key exemptions
api_key = request.headers.get('X-API-Key')
if api_key in EXEMPT_API_KEYS:
return True
return False
Chapter 7: Rate Limiting at the Edge
7.1 CDN-Level Rate Limiting
For DDoS protection, rate limit at the edge:
Internet
│
▼
┌─────────────────────────────────────────────────┐
│ CDN / Edge Network │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Edge │ │ Edge │ │ Edge │ ... │
│ │ POP 1 │ │ POP 2 │ │ POP 3 │ │
│ │(Tokyo) │ │(London) │ │(NYC) │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ │ │
│ Rate Limit Check │
│ (per IP, per region, global) │
└────────────────────┼────────────────────────────┘
│
▼
Origin Servers
Cloudflare rate limiting rules (example):
Rule: Block if > 100 requests in 10 seconds from same IP
Action: Block for 1 hour
Scope: All endpoints
Rule: Challenge if > 1000 requests in 1 minute from same /24 subnet
Action: JavaScript challenge
Scope: Login endpoints
7.2 API Gateway Rate Limiting
AWS API Gateway, Kong, or similar:
# Kong rate limiting configuration
plugins:
- name: rate-limiting
config:
minute: 100
hour: 1000
policy: redis
redis_host: redis.example.com
redis_port: 6379
redis_database: 0
fault_tolerant: true # Continue if Redis fails
hide_client_headers: false
Benefits:
- No application code changes
- Centralized configuration
- Can rate limit before hitting your servers
Limitations:
- Less flexibility than application-level
- Can't easily do user-tier-based limits without custom integration
Part IV: Discussion and Trade-offs
Chapter 8: The Hard Questions
8.1 "One Redis node dies—what's the behavior?"
Complete answer:
"When a Redis node in a cluster dies, keys hashed to that node are temporarily unavailable. Here's my approach:
First, understand the impact: with a 3-node cluster, ~33% of keys are affected. Rate limit checks for those keys will fail.
For the ~30-second failover window, I'd use a fallback strategy:
try:
return redis_limiter.is_allowed(key)
except redis.ClusterDownError:
return local_fallback_limiter.is_allowed(key)
The local fallback uses a conservative limit—maybe 10% of the global limit per server. With 50 servers, that's 500% of the global limit in total, so we might over-allow 5x during the outage. That's acceptable for 30 seconds.
After failover, the new primary takes over. If it was slightly behind the failed primary, we might have lost some recent increments—effectively a mini-reset of rate limits for those keys. I'd accept this rather than blocking all requests.
The key is: monitor this scenario. Alert when Redis cluster is degraded. Log when fallback mode activates. Review after every incident."
8.2 "Would you over-allow or under-allow during failures?"
Complete answer:
"It depends on what the rate limit is protecting.
I'd over-allow for:
- User-facing API limits (don't block paying customers)
- General abuse prevention (attackers can wait 30 seconds anyway)
- Non-critical endpoints
I'd under-allow for:
- Cost control on expensive downstream APIs (financial risk)
- Resource protection when the system is already stressed
- Security-sensitive endpoints (login, password reset)
For our platform, I'd configure per-rule behavior:
class RateLimitRule:
...
failure_mode: str # 'open' or 'closed'
Most rules would be 'open'. But the rule limiting calls to our AI model (which costs $0.01 per call) would be 'closed'—I'd rather block users than accidentally run up a huge bill.
The hybrid approach—local conservative limits during outage—gives a middle ground. Users aren't completely blocked, but we're not completely unprotected either."
8.3 "How do you prevent abuse of the rate limit system itself?"
Potential attacks:
- Key enumeration: Attacker tries to fill up rate limit storage with millions of unique keys
- Clock manipulation: If using client timestamps, attacker sends fake times
- Distributed attack: Attacker uses many IPs to stay under per-IP limits
Defenses:
# Defense against key enumeration
class SecureRateLimiter:
def __init__(self, max_unique_keys: int = 1_000_000):
self.max_unique_keys = max_unique_keys
def is_allowed(self, key: str) -> bool:
# Use server time, not client time
now = time.time()
# Limit total keys tracked (with LRU eviction)
if self.key_count() > self.max_unique_keys:
self.evict_oldest_keys()
# For anonymous users, use IP + user agent hash
# This makes it harder to generate unique keys
...
# Defense against distributed attacks
class IPSubnetLimiter:
def is_allowed(self, ip: str) -> bool:
# Limit per /24 subnet, not just per IP
subnet = self.get_subnet(ip, prefix_length=24)
return self.limiter.is_allowed(f"subnet:{subnet}")
Chapter 9: Session Summary
What You Should Know Now
After this session, you should be able to:
- Explain the four main algorithms (fixed window, sliding window log, sliding window counter, token bucket) with trade-offs
- Design distributed rate limiting using Redis or local+sync approaches
- Handle failure modes with explicit over-allow or under-allow decisions
- Implement multi-level rate limits (per-user, per-API-key, per-endpoint)
- Monitor and debug rate limiting systems
Key Trade-offs to Remember
| Decision | Trade-off |
|---|---|
| Fixed vs Sliding window | Simplicity vs Accuracy at boundaries |
| Token bucket vs Sliding window | Allows bursting vs Strict limits |
| Centralized vs Local counters | Accuracy vs Latency/Resilience |
| Over-allow vs Under-allow | User experience vs Protection |
| Per-IP vs Per-user limits | Coverage vs Fairness to shared IPs |
Questions to Ask in Every Design
- What are we protecting? (Resources? Cost? Fairness?)
- What's the acceptable accuracy? (Exact? Within 10%?)
- What happens during Redis failure?
- How do we communicate limits to users?
- How do we handle different user tiers?
Part V: Interview Questions and Answers
Chapter 10: Real-World Interview Scenarios
10.1 Conceptual Questions
Question 1: "Explain the token bucket algorithm."
Interviewer's Intent: Testing fundamental algorithm knowledge.
Strong Answer:
"Token bucket is a rate limiting algorithm that allows controlled bursting. Imagine a bucket that holds tokens—it fills at a constant rate (say, 10 tokens per second) up to a maximum capacity (say, 100 tokens).
Each request consumes tokens from the bucket. If there are enough tokens, the request is allowed and tokens are deducted. If not, the request is rejected.
The key properties:
- Long-term rate is bounded by the fill rate (10/sec)
- Short-term bursts are allowed up to bucket capacity (100)
- After a burst, you have to wait for the bucket to refill
Implementation-wise, you don't actually simulate filling—you calculate tokens based on time elapsed:
tokens = min(capacity, tokens + (elapsed_time * rate))
I'd use token bucket when I want to allow bursting, like a user who's been idle for a while making several rapid requests. For strict rate enforcement without bursting, I'd use sliding window counter instead."
Question 2: "What's the boundary problem with fixed window rate limiting?"
Interviewer's Intent: Testing understanding of algorithm edge cases.
Strong Answer:
"Fixed window rate limiting divides time into fixed intervals—say, minute boundaries. You count requests in each minute and reset at the boundary.
The problem: a user can make their full quota at the end of one window and again at the start of the next window, effectively doubling their rate.
Example with 100 requests/minute limit:
- 99 requests at 12:00:59 (Window 1)
- 99 requests at 12:01:01 (Window 2)
That's 198 requests in 2 seconds, way over the intended rate.
Sliding window counter fixes this by weighting the previous window's count based on how much of it overlaps with the current sliding window. If we're 30 seconds into the new window, we count 50% of the previous window's requests plus all of the current window's requests.
In practice, I'd always use sliding window counter for user-facing rate limits. Fixed window is only acceptable for very rough limiting where 2x burst at boundaries is tolerable."
Question 3: "How would you rate limit across multiple servers?"
Interviewer's Intent: Testing distributed systems thinking.
Strong Answer:
"The challenge is that without coordination, each server only sees a fraction of requests. A user could exceed limits by spreading requests across servers.
Three approaches:
Centralized store (Redis): All servers check/update a shared counter. Accurate but adds latency and creates a dependency.
Local + periodic sync: Each server has local counters, periodically syncing to Redis. Lower latency, but accuracy suffers between syncs.
Sticky sessions: Route each user to a consistent server. No coordination needed, but uneven load distribution and failover complexity.
I'd typically use Redis with the sliding window counter algorithm. The latency hit (~1-2ms) is acceptable for most APIs. For ultra-low-latency requirements, I'd use local counters with sync, accepting that we might over-allow by up to (number of servers × local limit) during the sync interval.
The key is building in graceful degradation: if Redis fails, fall back to local limits rather than failing completely open or closed."
10.2 Design Questions
Question 4: "Design a rate limiter for a multi-tenant SaaS API."
Interviewer's Intent: Testing end-to-end system design.
Strong Answer:
"Let me understand the requirements. Multi-tenant means different customers with potentially different limits. I'll assume we have tiers: free (100 req/min), pro (1000 req/min), and enterprise (custom).
Architecture:
Request → Identify Tenant → Look Up Limits → Check Counter → Allow/Deny
Tenant identification: From API key in header. Each API key maps to a tenant and tier.
Limit storage: I'd store limit configurations in PostgreSQL with caching:
class TenantConfig:
tenant_id: str
tier: str
custom_limits: dict # For enterprise overrides
Counter storage: Redis with sliding window counter. Keys like ratelimit:{tenant_id}:{endpoint_group}.
Multi-level limits:
- Per-tenant overall (1000/min)
- Per-tenant per-endpoint (e.g., 100/min for writes)
- Global system protection (100K/min total across all tenants)
Response: 429 Too Many Requests with headers showing limit, remaining, and reset time.
Failure handling: If Redis is down, fall back to local counters. Free tier gets strict local limits (fail closed to protect resources). Paid tiers get lenient local limits (fail open to avoid blocking paying customers).
Monitoring: Track rate limit hits per tenant, alert on high rejection rates, dashboard showing top consumers."
Question 5: "Design rate limiting for a real-time chat application."
Interviewer's Intent: Testing adaptation to specific use case.
Strong Answer:
"Chat has different requirements than typical APIs. Users expect instant message delivery, and messages come in bursts (typing a conversation).
Key considerations:
- Users chat in bursts—strict per-second limits would feel terrible
- Different limits for different actions (messages, media uploads, connection attempts)
- Abuse vectors: spam, flooding a room, repeated connection attempts
My approach:
Token bucket for messages: Fill rate 1 token/second, capacity 20 tokens. Users can burst 20 messages quickly (normal conversation), but sustained spam is blocked.
message_limits = {
'direct_message': TokenBucket(rate=1, capacity=20),
'group_message': TokenBucket(rate=0.5, capacity=10), # Stricter for groups
'media_upload': TokenBucket(rate=0.1, capacity=3), # Much stricter
}
Fixed window for connections: Limit reconnection attempts to prevent connection-spam attacks:
connection_limit = FixedWindow(limit=10, window_minutes=1)
Per-room limits: In addition to per-user limits, limit the total message rate per room to prevent coordinated spam:
room_limit = SlidingWindow(limit=100, window_seconds=10)
Implementation: Local limits on each chat server (which maintains WebSocket connections), with async sync to Redis for cross-server visibility. The local-first approach is crucial for low latency—we can't add a Redis round trip to every message.
User experience: Instead of hard rejection, I'd queue messages that exceed soft limits and deliver them slowly. Only hard reject for severe abuse."
10.3 Scenario-Based Questions
Question 6: "Your rate limiter is rejecting legitimate users during traffic spikes. What do you do?"
Interviewer's Intent: Testing operational problem-solving.
Strong Answer:
"First, triage: Is this affecting all users or specific ones? Are limits correctly configured? Is it actually the rate limiter or something else?
Investigation steps:
-
Check rate limiter metrics: What's the rejection rate? Which rules are triggering?
-
Look at specific users: Are they actually exceeding limits, or is the rate limiter malfunctioning?
-
Check Redis health: High latency or failures could cause incorrect counts.
Immediate mitigations (in order of risk):
-
Increase limits temporarily: If legitimate traffic grew, raise limits to accommodate:
redis-cli SET ratelimit:config:user_basic:limit 200 # Double the limit -
Add more headroom per tier: Maybe our limits were too conservative.
-
Switch to local-only limiting temporarily: If Redis is the bottleneck, local limits might be less restrictive.
-
Exempt specific users/endpoints: If one critical customer is blocked, add them to exemption list while we fix things.
Post-incident:
- Review whether limits match actual usage patterns
- Consider adaptive rate limiting that adjusts to traffic patterns
- Add alerting for rejection rate spikes
- Maybe implement graduated responses: soft limit (warn) before hard limit (block)"
Question 7: "How would you test your rate limiter?"
Interviewer's Intent: Testing quality engineering mindset.
Strong Answer:
"Testing rate limiters requires validating both correctness and behavior under failure.
Unit tests for algorithms:
def test_sliding_window_allows_under_limit():
limiter = SlidingWindowLimiter(limit=10, window_seconds=60)
for _ in range(10):
assert limiter.is_allowed('user1') == True
assert limiter.is_allowed('user1') == False
def test_sliding_window_resets():
limiter = SlidingWindowLimiter(limit=10, window_seconds=1)
for _ in range(10):
limiter.is_allowed('user1')
time.sleep(1)
assert limiter.is_allowed('user1') == True
def test_fixed_window_boundary_burst():
# Verify we understand the boundary problem
limiter = FixedWindowLimiter(limit=10, window_seconds=60)
# Simulate requests at window boundary
...
Integration tests with Redis:
def test_distributed_counting():
# Simulate multiple servers
limiter1 = RedisRateLimiter(redis_client, limit=10, window=60)
limiter2 = RedisRateLimiter(redis_client, limit=10, window=60)
for _ in range(5):
limiter1.is_allowed('user1')
limiter2.is_allowed('user1')
# Both should see the same count
assert limiter1.is_allowed('user1') == False
Load tests:
# Hit the rate limiter with 10K requests/second
# Verify:
# - Latency stays under 5ms p99
# - Correct rejection rate (within 10% of expected)
# - No memory leaks
# - Redis connection pool is healthy
Chaos tests:
def test_redis_failure_recovery():
# Make requests (should use Redis)
# Kill Redis
# Make requests (should use fallback)
# Verify fallback behavior matches config (open/closed)
# Restart Redis
# Make requests (should use Redis again)
Production testing:
- Shadow mode: Log what would be rejected without actually rejecting
- Canary: Route 1% of traffic to new limiter version
- Gradual rollout with kill switch"
10.4 Deep-Dive Questions
Question 8: "Compare rate limiting implementations in Redis vs in-memory."
Interviewer's Intent: Testing infrastructure trade-off knowledge.
Strong Answer:
"Redis-based:
Pros:
- Shared state across all servers (accurate global limits)
- Persistent (survives server restarts)
- Atomic operations (Lua scripts for complex logic)
- Scales independently of application servers
Cons:
- Network latency (1-2ms per check)
- Single point of failure (even clustered)
- Operational overhead
- Cost at scale
In-memory:
Pros:
- Near-zero latency (~microseconds)
- No external dependencies
- Simple deployment
- No network failure modes
Cons:
- Per-server limits only (inaccurate global limits)
- State lost on restart
- Scales with application servers (might not want that)
- Each server uses memory
When to use each:
Redis: User-facing API limits where accuracy matters, multi-tenant systems, when you need persistence.
In-memory: Ultra-low-latency requirements, per-connection limits (like WebSocket messages), local caching layer before Redis.
Hybrid: In-memory first pass (reject obvious violations), Redis for border cases. Or in-memory with periodic Redis sync.
For most web APIs, Redis with proper fallback is the right choice. The 1-2ms overhead is negligible compared to typical API latencies."
Question 9: "How do services like Cloudflare and AWS do rate limiting at scale?"
Interviewer's Intent: Testing knowledge of real-world systems.
Strong Answer:
"Cloudflare:
They rate limit at the edge—in their 300+ data centers worldwide. Each edge PoP makes local decisions with eventual consistency across PoPs.
For per-IP limits, this works well—most requests from one IP hit the same PoP. For per-user limits (requiring API key lookup), they need to route to origin or use distributed state.
They use a combination of in-memory counting at the edge and centralized aggregation for global limits. The key insight is that for DDoS protection, exact accuracy isn't needed—blocking 90% of an attack is almost as good as blocking 100%.
AWS API Gateway:
Uses token bucket algorithm. Regional limits are enforced within each AWS region. For global limits across regions, you'd need to build your own coordination.
They integrate with AWS WAF for additional IP-based limiting. The rate limiting state is stored in their internal distributed systems, not exposed to users.
Stripe:
Uses a custom sliding window implementation. They're notable for excellent rate limit headers and detailed error messages. They have per-resource limits (different limits for creating customers vs creating charges).
Common patterns:
- Edge limiting for DDoS (approximate, fast)
- Application limiting for quotas (accurate, slower)
- Hierarchical limits (IP < User < Account < Global)
- Graceful degradation during partial failures
The theme is: different types of limits at different layers, accepting that distributed rate limiting is inherently approximate."
Chapter 11: Interview Preparation Checklist
Before your interview, make sure you can:
Algorithms
- Explain fixed window, sliding window log, sliding window counter, token bucket
- Describe the boundary problem and how sliding window solves it
- Know when to use token bucket vs sliding window
Distributed Systems
- Design rate limiting with Redis
- Explain the coordination problem across multiple servers
- Implement fallback strategies for Redis failure
Failure Modes
- Articulate over-allow vs under-allow trade-off
- Design graceful degradation
- Know what to monitor and alert on
Real-World Knowledge
- Know how CDNs/API Gateways do rate limiting
- Understand multi-tenant rate limiting
- Cite real examples (Stripe, Cloudflare)
Exercises
Exercise 1: Algorithm Implementation
Implement a sliding window counter rate limiter that:
- Supports configurable window size and limit
- Returns remaining quota and reset time
- Handles concurrent access safely
Exercise 2: Distributed Design
Design a rate limiting system for a video streaming service:
- 1M concurrent users
- Limits on: video starts (5/min), API calls (100/min), bandwidth (10GB/day)
- Global and per-user limits
- Handle CDN edge servers and origin servers
Exercise 3: Failure Analysis
Your rate limiter uses a 3-node Redis cluster. Analyze what happens in each scenario:
- One node becomes slow (100ms latency)
- One node dies and fails over (30 second outage)
- Network partition isolates your app servers from Redis
- Redis runs out of memory
For each, describe the user experience and your recommended handling.
Further Reading
- "Rate Limiting" chapter in Designing Data-Intensive Applications
- Stripe's API Rate Limiting: Excellent documentation and implementation
- Cloudflare Blog: Posts on edge rate limiting and DDoS protection
- Redis Rate Limiting Patterns: Official Redis documentation
- Kong Rate Limiting Plugin: Open-source API gateway approach
Appendix: Code Reference
A.1 Complete Redis Rate Limiter
import redis
import time
from dataclasses import dataclass
from typing import Optional, Tuple
from enum import Enum
class RateLimitAlgorithm(Enum):
SLIDING_WINDOW = "sliding_window"
TOKEN_BUCKET = "token_bucket"
@dataclass
class RateLimitResult:
allowed: bool
limit: int
remaining: int
reset: int
retry_after: Optional[int] = None
class ProductionRateLimiter:
"""
Production-ready rate limiter with:
- Multiple algorithm support
- Graceful degradation
- Comprehensive metrics
- Multi-key support
"""
SLIDING_WINDOW_SCRIPT = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local window_start = now - window
redis.call('ZREMRANGEBYSCORE', key, 0, window_start)
local count = redis.call('ZCARD', key)
if count < limit then
redis.call('ZADD', key, now, now .. ':' .. math.random())
redis.call('EXPIRE', key, window)
return {1, limit - count - 1, math.ceil(now + window)}
else
local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
local reset = oldest[2] and (oldest[2] + window) or (now + window)
return {0, 0, math.ceil(reset)}
end
"""
TOKEN_BUCKET_SCRIPT = """
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_update')
local tokens = tonumber(data[1]) or capacity
local last_update = tonumber(data[2]) or now
local elapsed = now - last_update
tokens = math.min(capacity, tokens + (elapsed * rate))
local allowed = 0
if tokens >= requested then
tokens = tokens - requested
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_update', now)
redis.call('EXPIRE', key, math.ceil(capacity / rate) * 2)
local reset = now + ((capacity - tokens) / rate)
return {allowed, math.floor(tokens), math.ceil(reset)}
"""
def __init__(
self,
redis_client: redis.Redis,
default_limit: int = 100,
default_window: int = 60,
algorithm: RateLimitAlgorithm = RateLimitAlgorithm.SLIDING_WINDOW,
failure_mode: str = 'open',
local_fallback_limit: int = 10
):
self.redis = redis_client
self.default_limit = default_limit
self.default_window = default_window
self.algorithm = algorithm
self.failure_mode = failure_mode
self.local_fallback_limit = local_fallback_limit
# Register Lua scripts
self.sliding_window_script = self.redis.register_script(self.SLIDING_WINDOW_SCRIPT)
self.token_bucket_script = self.redis.register_script(self.TOKEN_BUCKET_SCRIPT)
# Local fallback
self.local_counts: dict[str, list[float]] = {}
self.redis_healthy = True
def is_allowed(
self,
key: str,
limit: Optional[int] = None,
window: Optional[int] = None,
cost: int = 1
) -> RateLimitResult:
"""
Check if request is allowed under rate limit.
"""
limit = limit or self.default_limit
window = window or self.default_window
try:
if self.algorithm == RateLimitAlgorithm.SLIDING_WINDOW:
result = self._sliding_window_check(key, limit, window)
else:
result = self._token_bucket_check(key, limit, window, cost)
self.redis_healthy = True
return result
except redis.RedisError as e:
self.redis_healthy = False
return self._fallback_check(key, limit, window)
def _sliding_window_check(
self,
key: str,
limit: int,
window: int
) -> RateLimitResult:
now = time.time()
result = self.sliding_window_script(
keys=[f"ratelimit:sw:{key}"],
args=[now, window, limit]
)
return RateLimitResult(
allowed=bool(result[0]),
limit=limit,
remaining=result[1],
reset=result[2],
retry_after=None if result[0] else max(0, result[2] - int(now))
)
def _token_bucket_check(
self,
key: str,
limit: int,
window: int,
cost: int
) -> RateLimitResult:
now = time.time()
rate = limit / window
capacity = limit
result = self.token_bucket_script(
keys=[f"ratelimit:tb:{key}"],
args=[rate, capacity, now, cost]
)
return RateLimitResult(
allowed=bool(result[0]),
limit=limit,
remaining=result[1],
reset=result[2],
retry_after=None if result[0] else max(0, result[2] - int(now))
)
def _fallback_check(self, key: str, limit: int, window: int) -> RateLimitResult:
"""
Local fallback when Redis is unavailable.
"""
if self.failure_mode == 'closed':
return RateLimitResult(
allowed=False,
limit=limit,
remaining=0,
reset=int(time.time()) + window,
retry_after=window
)
# Fail open with local limiting
now = time.time()
window_start = now - window
if key not in self.local_counts:
self.local_counts[key] = []
# Remove old entries
self.local_counts[key] = [t for t in self.local_counts[key] if t > window_start]
count = len(self.local_counts[key])
local_limit = min(self.local_fallback_limit, limit)
if count < local_limit:
self.local_counts[key].append(now)
return RateLimitResult(
allowed=True,
limit=local_limit,
remaining=local_limit - count - 1,
reset=int(now + window)
)
else:
return RateLimitResult(
allowed=False,
limit=local_limit,
remaining=0,
reset=int(now + window),
retry_after=window
)
# FastAPI integration
from fastapi import Request, HTTPException
from fastapi.responses import JSONResponse
def rate_limit_middleware(limiter: ProductionRateLimiter):
async def middleware(request: Request, call_next):
# Extract key from request
key = request.headers.get('X-API-Key') or request.client.host
result = limiter.is_allowed(key)
if not result.allowed:
return JSONResponse(
status_code=429,
content={
'error': 'rate_limit_exceeded',
'retry_after': result.retry_after
},
headers={
'X-RateLimit-Limit': str(result.limit),
'X-RateLimit-Remaining': str(result.remaining),
'X-RateLimit-Reset': str(result.reset),
'Retry-After': str(result.retry_after)
}
)
response = await call_next(request)
response.headers['X-RateLimit-Limit'] = str(result.limit)
response.headers['X-RateLimit-Remaining'] = str(result.remaining)
response.headers['X-RateLimit-Reset'] = str(result.reset)
return response
return middleware
A.2 Rate Limiter Testing Suite
import pytest
import time
import fakeredis
from unittest.mock import patch, MagicMock
class TestSlidingWindowLimiter:
@pytest.fixture
def limiter(self):
redis_client = fakeredis.FakeRedis()
return ProductionRateLimiter(
redis_client=redis_client,
default_limit=10,
default_window=60,
algorithm=RateLimitAlgorithm.SLIDING_WINDOW
)
def test_allows_under_limit(self, limiter):
for i in range(10):
result = limiter.is_allowed('user1')
assert result.allowed, f"Request {i+1} should be allowed"
assert result.remaining == 10 - i - 1
def test_blocks_over_limit(self, limiter):
for _ in range(10):
limiter.is_allowed('user1')
result = limiter.is_allowed('user1')
assert not result.allowed
assert result.remaining == 0
assert result.retry_after > 0
def test_separate_keys(self, limiter):
for _ in range(10):
limiter.is_allowed('user1')
# Different key should still be allowed
result = limiter.is_allowed('user2')
assert result.allowed
def test_window_reset(self, limiter):
limiter.default_window = 1 # 1 second window
for _ in range(10):
limiter.is_allowed('user1')
result = limiter.is_allowed('user1')
assert not result.allowed
time.sleep(1.1)
result = limiter.is_allowed('user1')
assert result.allowed
class TestFallbackBehavior:
def test_fail_open(self):
redis_client = MagicMock()
redis_client.register_script.return_value = MagicMock(
side_effect=redis.RedisError("Connection refused")
)
limiter = ProductionRateLimiter(
redis_client=redis_client,
failure_mode='open',
local_fallback_limit=5
)
# Should use local fallback
for i in range(5):
result = limiter.is_allowed('user1')
assert result.allowed
# Local limit exceeded
result = limiter.is_allowed('user1')
assert not result.allowed
def test_fail_closed(self):
redis_client = MagicMock()
redis_client.register_script.return_value = MagicMock(
side_effect=redis.RedisError("Connection refused")
)
limiter = ProductionRateLimiter(
redis_client=redis_client,
failure_mode='closed'
)
result = limiter.is_allowed('user1')
assert not result.allowed
End of Day 3: Rate Limiting at Scale
Tomorrow: Day 4 — Hot Keys and Skew. We'll redesign the URL shortener assuming 0.01% of URLs get 90% of traffic, and explore how companies like Instagram handle celebrity posts that get millions of interactions.