Day 03

Week 1 — Day 3: Rate Limiting at Scale

System Design Mastery Series

Preface

Days 1 and 2 taught you how to store and replicate data. Today, you learn how to protect those systems from being overwhelmed.

Rate limiting sounds simple: count requests, reject when over limit. But at scale, rate limiting becomes a distributed systems problem with no perfect solution. You'll face questions like:

How do you count requests across 50 servers?
What happens when your rate limit store fails?
Is it better to accidentally allow too many requests, or accidentally block legitimate ones?

By the end of this session, you'll understand the algorithms, the distributed coordination challenges, and the failure modes that make rate limiting one of the most deceptively complex components in modern systems.

Let's begin.

Part I: Foundations

Chapter 1: Why Rate Limiting Matters

1.1 The Threats You're Defending Against

Rate limiting protects your system from several classes of problems:

Threat 1: Denial of Service (DoS)

A malicious actor floods your API with requests, consuming all available resources. Legitimate users can't get through.

Without rate limiting: One bad actor with a botnet can take down your entire service.

With rate limiting: Each IP/user is limited to N requests per second. The attacker is throttled; legitimate users continue normally.

Threat 2: Resource Exhaustion

A single customer accidentally runs a script that makes 1 million API calls in a loop. They didn't mean to, but they're consuming resources meant for all customers.

Without rate limiting: One runaway script degrades service for everyone.

With rate limiting: The script hits its limit, gets throttled, and the customer fixes their code.

Threat 3: Cost Control

Your service calls expensive downstream APIs (payment processors, AI models, third-party data). Each call costs money.

Without rate limiting: A bug or abuse could generate a massive bill.

With rate limiting: You cap the maximum spend per customer per time period.

Threat 4: Fair Resource Allocation

You have 100 customers sharing infrastructure. One customer is 10x larger than the others. Without limits, they'd consume most resources.

With rate limiting: Each customer gets a fair share (or a share proportional to their plan).

1.2 The Rate Limiting Contract

A rate limit is a contract with your users:

"You can make up to N requests per time period T. If you exceed this, we'll reject additional requests until the period resets."

Example contracts:

100 requests per minute per API key
1000 requests per hour per IP address
10 requests per second per user (with bursting up to 50)

The contract must be:

Clear: Users know their limits before hitting them
Predictable: Same behavior every time
Fair: Applied consistently across all users
Communicated: Response headers tell users their remaining quota

1.3 Where Rate Limiting Lives

Rate limiting can be implemented at multiple layers:

Internet
    │
    ▼
┌─────────────────┐
│   CDN / Edge    │  ← Layer 1: Geographic rate limiting, DDoS protection
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Load Balancer  │  ← Layer 2: Connection limits, basic throttling
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   API Gateway   │  ← Layer 3: Per-API-key limits, quotas
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Application   │  ← Layer 4: Business logic limits (e.g., 5 password attempts)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Database     │  ← Layer 5: Connection pools, query limits
└─────────────────┘

Today, we focus on Layers 3 and 4: API-level rate limiting that requires distributed coordination.

Chapter 2: Rate Limiting Algorithms

There are four major algorithms for rate limiting. Each has different characteristics for burstiness, memory usage, and accuracy.

2.1 Fixed Window Counter

How It Works

Divide time into fixed windows (e.g., 1-minute intervals). Count requests in each window. Reset the counter when the window changes.

Window 1 (00:00-00:59): [||||||||||] 10 requests → Allow
Window 2 (01:00-01:59): [||||||||||||||||||||] 20 requests → Allow 10, reject 10
Window 3 (02:00-02:59): [|||] 3 requests → Allow

Implementation:

class FixedWindowLimiter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.counters: dict[str, tuple[int, int]] = {}  # key -> (count, window_start)
    
    def is_allowed(self, key: str) -> bool:
        now = int(time.time())
        window_start = now - (now % self.window_seconds)
        
        count, stored_window = self.counters.get(key, (0, window_start))
        
        # New window - reset counter
        if stored_window != window_start:
            count = 0
            stored_window = window_start
        
        if count < self.limit:
            self.counters[key] = (count + 1, stored_window)
            return True
        return False

Strengths

Simple: Easy to understand and implement
Memory efficient: One counter per key per window
Fast: O(1) lookup and update

Weaknesses

Boundary problem: A user can make 2x the limit at the boundary between windows

Limit: 10 requests per minute

Window 1 (00:00-00:59): User makes 10 requests at 00:55
Window 2 (01:00-01:59): User makes 10 requests at 01:05

Result: 20 requests in 10 seconds (00:55 to 01:05)
        But each window only saw 10 → both allowed!

This "burst at the boundary" can double the effective rate.

2.2 Sliding Window Log

How It Works

Store the timestamp of each request. To check if a new request is allowed, count requests in the past window (sliding).

Limit: 10 requests per 60 seconds
Current time: 12:05:30

Request log for user: [12:04:35, 12:04:40, 12:05:00, 12:05:10, 12:05:15]
                       ↑ older than 60s ago, ignored
                       
Requests in window (12:04:30 to 12:05:30): 4
New request allowed: Yes (4 < 10)

Implementation:

class SlidingWindowLogLimiter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.logs: dict[str, list[float]] = {}
    
    def is_allowed(self, key: str) -> bool:
        now = time.time()
        window_start = now - self.window_seconds
        
        # Get existing timestamps, filter out old ones
        timestamps = self.logs.get(key, [])
        timestamps = [t for t in timestamps if t > window_start]
        
        if len(timestamps) < self.limit:
            timestamps.append(now)
            self.logs[key] = timestamps
            return True
        return False

Strengths

Accurate: No boundary problem—the window truly slides
Precise: Exact count of requests in any rolling window

Weaknesses

Memory intensive: Must store timestamp of every request
Cleanup overhead: Need to periodically remove old timestamps
Slow at scale: Counting requests requires scanning the log

For high-volume APIs (thousands of requests per second per key), this becomes expensive.

2.3 Sliding Window Counter

How It Works

A hybrid approach: use fixed windows but weight them based on how much of the current sliding window they cover.

Limit: 10 requests per 60 seconds
Current time: 12:01:15 (15 seconds into the new minute)

Previous window (12:00:00-12:00:59): 8 requests
Current window (12:01:00-12:01:59): 3 requests (so far)

Weighted count = (previous × overlap%) + current
               = (8 × 75%) + 3
               = 6 + 3
               = 9

9 < 10 → Request allowed

The "overlap%" is how much of the previous window is still within our 60-second sliding window:

At 12:01:15, we're 15 seconds into the new window
So the previous window contributes 75% (45 seconds of overlap / 60 seconds)

Implementation:

class SlidingWindowCounterLimiter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.counters: dict[str, dict] = {}
    
    def is_allowed(self, key: str) -> bool:
        now = time.time()
        current_window = int(now // self.window_seconds)
        window_position = now % self.window_seconds
        
        data = self.counters.get(key, {
            'prev_count': 0,
            'prev_window': current_window - 1,
            'curr_count': 0,
            'curr_window': current_window
        })
        
        # Handle window transitions
        if data['curr_window'] < current_window:
            data['prev_count'] = data['curr_count'] if data['curr_window'] == current_window - 1 else 0
            data['prev_window'] = data['curr_window']
            data['curr_count'] = 0
            data['curr_window'] = current_window
        
        # Calculate weighted count
        prev_weight = (self.window_seconds - window_position) / self.window_seconds
        weighted_count = (data['prev_count'] * prev_weight) + data['curr_count']
        
        if weighted_count < self.limit:
            data['curr_count'] += 1
            self.counters[key] = data
            return True
        return False

Strengths

Accurate enough: Smooths the boundary problem without storing every timestamp
Memory efficient: Only two counters per key
Fast: O(1) operations

Weaknesses

Approximate: Not perfectly accurate (weighted estimate, not exact count)
Slightly more complex: Than fixed window

This is the sweet spot for most production systems.

2.4 Token Bucket

How It Works

Imagine a bucket that holds tokens. The bucket fills at a constant rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected.

The bucket has a maximum capacity, allowing for controlled bursting.

Configuration: 10 tokens/second rate, 50 token capacity

Time 0: Bucket has 50 tokens (full)
        30 requests arrive → 30 allowed, 20 tokens remain

Time 1: Bucket refills 10 tokens → 30 tokens
        5 requests arrive → 5 allowed, 25 tokens remain

Time 2: Bucket refills 10 tokens → 35 tokens
        0 requests → 35 tokens remain

Time 3: Bucket refills 10 tokens → 45 tokens
        60 requests → 45 allowed, 15 rejected, 0 tokens remain

Implementation:

class TokenBucketLimiter:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate  # tokens per second
        self.capacity = capacity
        self.buckets: dict[str, tuple[float, float]] = {}  # key -> (tokens, last_update)
    
    def is_allowed(self, key: str, tokens_requested: int = 1) -> bool:
        now = time.time()
        
        current_tokens, last_update = self.buckets.get(key, (self.capacity, now))
        
        # Add tokens based on time elapsed
        elapsed = now - last_update
        current_tokens = min(self.capacity, current_tokens + (elapsed * self.rate))
        
        if current_tokens >= tokens_requested:
            current_tokens -= tokens_requested
            self.buckets[key] = (current_tokens, now)
            return True
        else:
            self.buckets[key] = (current_tokens, now)
            return False

Strengths

Allows bursting: Users can burst up to bucket capacity
Smooth rate over time: Long-term rate converges to the configured rate
Flexible: Can charge different costs for different operations

Weaknesses

Bursting can be problematic: If all users burst simultaneously, you might exceed capacity
Slightly more state: Need to track tokens and timestamp per key

2.5 Leaky Bucket

How It Works

Requests enter a bucket (queue). The bucket "leaks" at a constant rate—requests are processed at a fixed rate regardless of arrival pattern.

Configuration: Process 10 requests/second, queue capacity 50

Requests arrive in burst: [30 requests at once]
Queue: [30 requests waiting]
Processing: 10/second

After 1 second: 10 processed, 20 in queue
After 2 seconds: 20 processed, 10 in queue
After 3 seconds: 30 processed, queue empty

Strengths

Smooth output rate: Downstream systems see a constant load
Prevents bursting: Output is always at the configured rate

Weaknesses

Adds latency: Requests wait in queue
Queue management: Need to handle full queue (reject new requests)
Not suitable for real-time: Users wait even when system has capacity

Leaky bucket is common for smoothing traffic to downstream services, not for user-facing rate limits.

2.6 Algorithm Comparison

Algorithm	Accuracy	Memory	Burstiness	Best For
Fixed Window	Low	Very Low	High (2x at boundary)	Simple use cases
Sliding Window Log	Exact	High	None	When accuracy is critical
Sliding Window Counter	Good	Low	Low	Most API rate limits
Token Bucket	Configurable	Low	Controlled	Allowing bursts, varied costs
Leaky Bucket	N/A	Medium	None	Smoothing to downstream

Recommendation: Start with sliding window counter for API rate limiting. Use token bucket if you need controlled bursting.

Chapter 3: Distributed Rate Limiting

Everything so far assumed a single server. In reality, you have 50 servers, and requests from the same user can hit any of them.

3.1 The Coordination Problem

User makes 100 requests
Limit: 10 requests/minute

Without coordination:
  Server 1: Sees 2 requests → Allows all
  Server 2: Sees 3 requests → Allows all
  Server 3: Sees 2 requests → Allows all
  ...
  Server 50: Sees 2 requests → Allows all
  
  Total allowed: 100 (should have been 10!)

Each server only sees a fraction of the traffic. Without coordination, users can exceed limits by spreading requests across servers.

3.2 Solution 1: Centralized Counter (Redis)

Store counters in a central store that all servers access.

                    ┌───────────────────────────────────────────┐
                    │              Application Servers           │
                    │  ┌─────┐ ┌─────┐ ┌─────┐       ┌─────┐   │
                    │  │ S1  │ │ S2  │ │ S3  │  ...  │ S50 │   │
                    │  └──┬──┘ └──┬──┘ └──┬──┘       └──┬──┘   │
                    └─────┼──────┼──────┼─────────────┼────────┘
                          │      │      │             │
                          └──────┴──────┴─────────────┘
                                        │
                                        ▼
                              ┌─────────────────┐
                              │   Redis Cluster │
                              │   (Counters)    │
                              └─────────────────┘

Implementation with Redis:

import redis
import time

class RedisRateLimiter:
    def __init__(self, redis_client: redis.Redis, limit: int, window_seconds: int):
        self.redis = redis_client
        self.limit = limit
        self.window_seconds = window_seconds
    
    def is_allowed(self, key: str) -> tuple[bool, dict]:
        """
        Returns (allowed, info) where info contains remaining quota and reset time.
        Uses sliding window counter algorithm.
        """
        now = time.time()
        window_key = f"ratelimit:{key}"
        
        # Use Redis pipeline for atomic operations
        pipe = self.redis.pipeline()
        
        # Remove old entries (outside window)
        window_start = now - self.window_seconds
        pipe.zremrangebyscore(window_key, 0, window_start)
        
        # Count current entries
        pipe.zcard(window_key)
        
        # Add current request (optimistically)
        pipe.zadd(window_key, {f"{now}:{id(now)}": now})
        
        # Set expiry on the key
        pipe.expire(window_key, self.window_seconds)
        
        results = pipe.execute()
        current_count = results[1]
        
        if current_count < self.limit:
            remaining = self.limit - current_count - 1
            return True, {
                'allowed': True,
                'remaining': max(0, remaining),
                'reset': int(now + self.window_seconds),
                'limit': self.limit
            }
        else:
            # Over limit - remove the optimistically added entry
            self.redis.zrem(window_key, f"{now}:{id(now)}")
            return False, {
                'allowed': False,
                'remaining': 0,
                'reset': int(now + self.window_seconds),
                'limit': self.limit
            }

Optimized Token Bucket in Redis (Lua script):

-- Token bucket rate limiter as a Lua script (atomic execution)
-- KEYS[1] = rate limit key
-- ARGV[1] = rate (tokens per second)
-- ARGV[2] = capacity (max tokens)
-- ARGV[3] = current timestamp
-- ARGV[4] = tokens requested

local key = KEYS[1]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

-- Get current state
local data = redis.call('HMGET', key, 'tokens', 'last_update')
local tokens = tonumber(data[1]) or capacity
local last_update = tonumber(data[2]) or now

-- Calculate token refill
local elapsed = now - last_update
local new_tokens = math.min(capacity, tokens + (elapsed * rate))

-- Check if request can be fulfilled
local allowed = 0
if new_tokens >= requested then
    new_tokens = new_tokens - requested
    allowed = 1
end

-- Update state
redis.call('HMSET', key, 'tokens', new_tokens, 'last_update', now)
redis.call('EXPIRE', key, math.ceil(capacity / rate) * 2)

return {allowed, new_tokens, capacity}

class RedisTokenBucket:
    LUA_SCRIPT = """..."""  # The Lua script above
    
    def __init__(self, redis_client: redis.Redis, rate: float, capacity: int):
        self.redis = redis_client
        self.rate = rate
        self.capacity = capacity
        self.script = self.redis.register_script(self.LUA_SCRIPT)
    
    def is_allowed(self, key: str, tokens: int = 1) -> tuple[bool, float]:
        now = time.time()
        result = self.script(
            keys=[f"tokenbucket:{key}"],
            args=[self.rate, self.capacity, now, tokens]
        )
        return bool(result[0]), result[1]  # (allowed, remaining_tokens)

Strengths

Accurate: All servers share the same counter
Simple model: Easy to reason about

Weaknesses

Single point of failure: If Redis goes down, what happens?
Latency: Every request requires a Redis round-trip
Scalability ceiling: Redis cluster has limits

3.3 Solution 2: Local Counters with Synchronization

Each server maintains local counters, periodically syncing with the central store.

Server 1:                          Server 2:
  Local count: 5                     Local count: 3
  Last sync: 10 sec ago              Last sync: 5 sec ago
         │                                  │
         └────────────┬─────────────────────┘
                      │ Periodic sync
                      ▼
              ┌───────────────┐
              │    Redis      │
              │ Global count  │
              └───────────────┘

Implementation:

import threading
import time

class LocalSyncRateLimiter:
    def __init__(
        self,
        redis_client: redis.Redis,
        local_limit: int,
        global_limit: int,
        window_seconds: int,
        sync_interval: float = 1.0
    ):
        self.redis = redis_client
        self.local_limit = local_limit  # Per-server allowance
        self.global_limit = global_limit
        self.window_seconds = window_seconds
        self.sync_interval = sync_interval
        
        self.local_counts: dict[str, int] = {}
        self.lock = threading.Lock()
        
        # Start sync thread
        self.sync_thread = threading.Thread(target=self._sync_loop, daemon=True)
        self.sync_thread.start()
    
    def is_allowed(self, key: str) -> bool:
        with self.lock:
            count = self.local_counts.get(key, 0)
            if count < self.local_limit:
                self.local_counts[key] = count + 1
                return True
            return False
    
    def _sync_loop(self):
        while True:
            time.sleep(self.sync_interval)
            self._sync_to_redis()
    
    def _sync_to_redis(self):
        with self.lock:
            counts_to_sync = self.local_counts.copy()
            self.local_counts.clear()
        
        pipe = self.redis.pipeline()
        for key, count in counts_to_sync.items():
            redis_key = f"ratelimit:{key}"
            pipe.incrby(redis_key, count)
            pipe.expire(redis_key, self.window_seconds)
        pipe.execute()

Strengths

Lower latency: Local checks are fast
Resilient: Can continue (briefly) if Redis is unavailable

Weaknesses

Approximate: Can exceed global limits between syncs
More complex: Need to tune local vs global limits
Coordination: Server count affects local limits

3.4 Solution 3: Sticky Sessions

Route requests from the same user to the same server. Each server handles a subset of users.

Load Balancer (sticky sessions by user_id)
              │
    ┌─────────┼─────────┐
    ▼         ▼         ▼
┌───────┐ ┌───────┐ ┌───────┐
│Server1│ │Server2│ │Server3│
│User A │ │User B │ │User C │
│User D │ │User E │ │User F │
└───────┘ └───────┘ └───────┘

Each server tracks limits only for its assigned users—no coordination needed!

Strengths

No coordination: Each server is authoritative for its users
Fast: No remote calls for rate limiting
Simple: Local data structures only

Weaknesses

Uneven load: Some users are heavier than others
Failover complexity: What happens when a server dies?
Not always possible: Some architectures can't do sticky sessions

3.5 Comparing Distributed Approaches

Approach	Accuracy	Latency	Failure Mode	Complexity
Centralized Redis	High	+1-2ms	Total failure if Redis down	Low
Local + Sync	Approximate	Very low	Continues with local limits	Medium
Sticky Sessions	High (per server)	Very low	Failover disruption	Low
Hybrid	High	Low	Graceful degradation	High

Chapter 4: Failure Modes and Graceful Degradation

The most important question in distributed rate limiting:

When your rate limit store fails, do you over-allow (let requests through) or under-allow (reject requests)?

4.1 The Over-Allow vs Under-Allow Trade-off

Over-allow (fail open):

If Redis is down, allow all requests
Pro: Legitimate users aren't blocked
Con: No protection during outage; abuse possible

Under-allow (fail closed):

If Redis is down, reject all requests
Pro: System protected from overload
Con: Legitimate users blocked during outage

Neither is universally correct. The choice depends on what you're protecting.

4.2 Decision Framework

Scenario	Recommendation	Reasoning
DDoS protection	Under-allow	Better to block everyone than let attack through
API monetization	Over-allow	Don't lose paying customers due to your bug
Resource protection	Under-allow	Protect backend systems from overload
User-facing rate limits	Over-allow	User experience > strict enforcement
Cost control (expensive APIs)	Under-allow	Financial risk of over-allowing is high

4.3 Implementing Graceful Degradation

Hybrid approach: Local limits as fallback when centralized store fails.

class ResilientRateLimiter:
    def __init__(
        self,
        redis_client: redis.Redis,
        global_limit: int,
        local_fallback_limit: int,
        window_seconds: int,
        failure_mode: str = 'over_allow'  # or 'under_allow'
    ):
        self.redis = redis_client
        self.global_limit = global_limit
        self.local_fallback_limit = local_fallback_limit
        self.window_seconds = window_seconds
        self.failure_mode = failure_mode
        
        self.local_limiter = SlidingWindowCounterLimiter(
            local_fallback_limit, window_seconds
        )
        self.redis_available = True
        self.last_redis_check = 0
        self.redis_check_interval = 5  # seconds
    
    def is_allowed(self, key: str) -> tuple[bool, str]:
        """Returns (allowed, mode) where mode is 'redis', 'local', or 'failed'."""
        
        # Try Redis first
        if self._redis_is_available():
            try:
                return self._check_redis(key), 'redis'
            except redis.RedisError:
                self.redis_available = False
        
        # Redis unavailable - use fallback strategy
        if self.failure_mode == 'over_allow':
            # Use local limiter as soft limit
            allowed = self.local_limiter.is_allowed(key)
            return allowed, 'local'
        else:
            # Fail closed - but maybe allow some traffic
            if self.local_fallback_limit > 0:
                # Very conservative local limit
                allowed = self.local_limiter.is_allowed(key)
                return allowed, 'local_strict'
            return False, 'failed'
    
    def _redis_is_available(self) -> bool:
        now = time.time()
        if not self.redis_available:
            # Periodically retry Redis
            if now - self.last_redis_check > self.redis_check_interval:
                self.last_redis_check = now
                try:
                    self.redis.ping()
                    self.redis_available = True
                except redis.RedisError:
                    pass
        return self.redis_available
    
    def _check_redis(self, key: str) -> bool:
        # Implement your Redis rate limit check here
        ...

4.4 Monitoring Rate Limiter Health

You need to know when your rate limiter is degraded:

from prometheus_client import Counter, Histogram, Gauge

# Metrics
rate_limit_decisions = Counter(
    'rate_limit_decisions_total',
    'Rate limit decisions',
    ['result', 'mode']  # result=allowed/denied, mode=redis/local/failed
)

rate_limit_latency = Histogram(
    'rate_limit_check_seconds',
    'Time to check rate limit',
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1]
)

redis_available = Gauge(
    'rate_limiter_redis_available',
    'Whether Redis is available for rate limiting'
)

class InstrumentedRateLimiter:
    def __init__(self, limiter: ResilientRateLimiter):
        self.limiter = limiter
    
    def is_allowed(self, key: str) -> bool:
        with rate_limit_latency.time():
            allowed, mode = self.limiter.is_allowed(key)
        
        result = 'allowed' if allowed else 'denied'
        rate_limit_decisions.labels(result=result, mode=mode).inc()
        redis_available.set(1 if self.limiter.redis_available else 0)
        
        return allowed

Key metrics to alert on:

Redis availability dropping
Rate limit latency increasing
Unusual spike in denials
Mode switching from 'redis' to 'local'

Part II: The Design Challenge

Chapter 5: Designing a Distributed Rate Limiter

You're designing a rate limiting service for your platform. Requirements:

Scale: 10,000 requests per second across 50 application servers
Limits: Per-user limits (100 req/min), per-API-key limits (1000 req/min), global limits
Latency: Rate limit check must add < 5ms to request latency (p99)
Accuracy: Within 10% of configured limits
Resilience: Continue operating if Redis has a brief outage

5.1 Architecture Overview

                              ┌────────────────────────────────────────────┐
                              │              Rate Limit Service             │
                              │                                            │
┌─────────────┐               │   ┌────────────────────────────────────┐   │
│   Request   │               │   │          Configuration Store        │   │
│   from      │───────────────┼──▶│   (Limit rules, per-tier configs)  │   │
│   Client    │               │   └────────────────────────────────────┘   │
└─────────────┘               │                    │                       │
                              │                    ▼                       │
                              │   ┌────────────────────────────────────┐   │
                              │   │         Rate Limit Logic            │   │
                              │   │   ┌──────────┐  ┌──────────────┐   │   │
                              │   │   │  Local   │  │    Redis     │   │   │
                              │   │   │  Cache   │◀─┼──▶  Cluster   │   │   │
                              │   │   └──────────┘  └──────────────┘   │   │
                              │   └────────────────────────────────────┘   │
                              │                    │                       │
                              │                    ▼                       │
                              │            ┌───────────────┐               │
                              │            │   Decision    │               │
                              │            │ Allow / Deny  │               │
                              │            └───────────────┘               │
                              └────────────────────────────────────────────┘

5.2 Data Model

Rate limit rules (stored in configuration):

@dataclass
class RateLimitRule:
    id: str
    name: str
    key_pattern: str  # e.g., "user:{user_id}", "api_key:{api_key}"
    limit: int
    window_seconds: int
    algorithm: str  # 'sliding_window' or 'token_bucket'
    burst_capacity: Optional[int]  # For token bucket
    scope: str  # 'per_user', 'per_api_key', 'global'
    tier: Optional[str]  # 'free', 'pro', 'enterprise'
    
# Example rules
rules = [
    RateLimitRule(
        id="user_basic",
        name="Basic user limit",
        key_pattern="user:{user_id}",
        limit=100,
        window_seconds=60,
        algorithm="sliding_window",
        scope="per_user",
        tier="free"
    ),
    RateLimitRule(
        id="user_pro",
        name="Pro user limit",
        key_pattern="user:{user_id}",
        limit=1000,
        window_seconds=60,
        algorithm="sliding_window",
        scope="per_user",
        tier="pro"
    ),
    RateLimitRule(
        id="api_key",
        name="API key limit",
        key_pattern="apikey:{api_key}",
        limit=10000,
        window_seconds=3600,
        algorithm="token_bucket",
        burst_capacity=500,
        scope="per_api_key"
    ),
    RateLimitRule(
        id="global_writes",
        name="Global write limit",
        key_pattern="global:writes",
        limit=5000,
        window_seconds=1,
        algorithm="sliding_window",
        scope="global"
    )
]

5.3 Multi-Level Rate Limiting

A single request might be subject to multiple limits:

class MultiLevelRateLimiter:
    def __init__(self, redis_client: redis.Redis, rules: list[RateLimitRule]):
        self.redis = redis_client
        self.rules = rules
        self.limiters = self._create_limiters()
    
    def _create_limiters(self) -> dict[str, RateLimiter]:
        limiters = {}
        for rule in self.rules:
            if rule.algorithm == 'sliding_window':
                limiters[rule.id] = RedisSlidingWindowLimiter(
                    self.redis, rule.limit, rule.window_seconds
                )
            elif rule.algorithm == 'token_bucket':
                limiters[rule.id] = RedisTokenBucket(
                    self.redis, 
                    rate=rule.limit / rule.window_seconds,
                    capacity=rule.burst_capacity or rule.limit
                )
        return limiters
    
    def check_request(self, context: RequestContext) -> RateLimitResult:
        """
        Check all applicable rules for this request.
        Returns the most restrictive result.
        """
        applicable_rules = self._get_applicable_rules(context)
        
        results = []
        for rule in applicable_rules:
            key = self._build_key(rule, context)
            limiter = self.limiters[rule.id]
            
            allowed, info = limiter.is_allowed(key)
            results.append(RateLimitResult(
                rule_id=rule.id,
                allowed=allowed,
                limit=rule.limit,
                remaining=info.get('remaining', 0),
                reset=info.get('reset', 0)
            ))
        
        # If any rule denies, the request is denied
        denied_results = [r for r in results if not r.allowed]
        if denied_results:
            # Return the result with the longest wait time
            return min(denied_results, key=lambda r: r.remaining)
        
        # All rules allow - return the one with lowest remaining quota
        return min(results, key=lambda r: r.remaining)
    
    def _get_applicable_rules(self, context: RequestContext) -> list[RateLimitRule]:
        applicable = []
        for rule in self.rules:
            # Check tier match
            if rule.tier and rule.tier != context.user_tier:
                continue
            # Check scope
            if rule.scope == 'per_user' and not context.user_id:
                continue
            if rule.scope == 'per_api_key' and not context.api_key:
                continue
            applicable.append(rule)
        return applicable
    
    def _build_key(self, rule: RateLimitRule, context: RequestContext) -> str:
        key = rule.key_pattern
        key = key.replace('{user_id}', context.user_id or 'anonymous')
        key = key.replace('{api_key}', context.api_key or 'none')
        key = key.replace('{ip}', context.ip_address)
        return key

5.4 Response Headers

Communicate rate limit status to clients:

def add_rate_limit_headers(response: Response, result: RateLimitResult):
    response.headers['X-RateLimit-Limit'] = str(result.limit)
    response.headers['X-RateLimit-Remaining'] = str(result.remaining)
    response.headers['X-RateLimit-Reset'] = str(result.reset)
    
    if not result.allowed:
        response.headers['Retry-After'] = str(result.reset - int(time.time()))

Standard headers:

X-RateLimit-Limit: The rate limit ceiling
X-RateLimit-Remaining: Requests remaining in current window
X-RateLimit-Reset: Unix timestamp when the limit resets
Retry-After: Seconds until client should retry (on 429)

5.5 The Challenge Question: Redis Node Dies

Scenario: You have a 3-node Redis cluster. One node dies. What happens to rate limiting?

Analysis:

Keys on the dead node are temporarily inaccessible
- In Redis Cluster, keys are distributed across nodes
- ~33% of keys are on the dead node
- Those rate limit counters are unavailable
Cluster failover takes time
- Redis Cluster automatic failover: 10-30 seconds
- During this time, reads/writes to affected keys fail

Options during the outage:

Option A: Fail open for affected keys

def is_allowed(self, key: str) -> bool:
    try:
        return self._check_redis(key)
    except redis.ClusterDownError:
        # Redis node down - allow the request
        return True

Option B: Fail closed for affected keys

def is_allowed(self, key: str) -> bool:
    try:
        return self._check_redis(key)
    except redis.ClusterDownError:
        # Redis node down - deny the request
        return False

Option C: Fall back to local rate limiting

def is_allowed(self, key: str) -> bool:
    try:
        return self._check_redis(key)
    except redis.ClusterDownError:
        # Use conservative local limit
        return self.local_limiter.is_allowed(key)

After failover completes:
- If replica was in sync, no data loss
- If replica was behind, some recent increments lost
- Rate limits might be slightly "reset" for affected keys

My recommendation: Option C (local fallback) with conservative limits. During the ~30 second outage:

66% of keys (other nodes) continue normally
33% of keys (affected node) use local fallback
No users are completely blocked
Some users might slightly exceed limits
System remains available

5.6 Complete Implementation

from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import redis
from typing import Optional

app = FastAPI()

class RateLimitMiddleware:
    def __init__(self, app: FastAPI, limiter: MultiLevelRateLimiter):
        self.app = app
        self.limiter = limiter
    
    async def __call__(self, request: Request, call_next):
        # Extract context from request
        context = RequestContext(
            user_id=request.headers.get('X-User-ID'),
            api_key=request.headers.get('X-API-Key'),
            ip_address=request.client.host,
            user_tier=request.headers.get('X-User-Tier', 'free'),
            endpoint=request.url.path,
            method=request.method
        )
        
        # Check rate limits
        result = self.limiter.check_request(context)
        
        if not result.allowed:
            response = JSONResponse(
                status_code=429,
                content={
                    'error': 'rate_limit_exceeded',
                    'message': f'Rate limit exceeded. Retry after {result.reset - int(time.time())} seconds.',
                    'limit': result.limit,
                    'reset': result.reset
                }
            )
            add_rate_limit_headers(response, result)
            return response
        
        # Proceed with request
        response = await call_next(request)
        add_rate_limit_headers(response, result)
        return response


# Setup
redis_client = redis.RedisCluster(
    host='redis-cluster.example.com',
    port=6379,
    decode_responses=True
)

limiter = MultiLevelRateLimiter(redis_client, rules)
app.middleware('http')(RateLimitMiddleware(app, limiter))

Part III: Advanced Topics

Chapter 6: Rate Limiting Patterns

6.1 Tiered Rate Limits

Different limits for different user tiers:

TIER_LIMITS = {
    'free': {
        'requests_per_minute': 60,
        'requests_per_day': 1000,
        'burst': 10
    },
    'pro': {
        'requests_per_minute': 600,
        'requests_per_day': 50000,
        'burst': 100
    },
    'enterprise': {
        'requests_per_minute': 6000,
        'requests_per_day': None,  # Unlimited
        'burst': 1000
    }
}

def get_limits_for_user(user: User) -> dict:
    return TIER_LIMITS.get(user.tier, TIER_LIMITS['free'])

6.2 Endpoint-Specific Limits

Different endpoints have different costs:

ENDPOINT_COSTS = {
    'GET /api/users': 1,
    'POST /api/users': 5,
    'GET /api/reports': 10,  # Expensive query
    'POST /api/ai/generate': 100,  # Very expensive
}

class CostAwareRateLimiter:
    def __init__(self, limiter: TokenBucketLimiter):
        self.limiter = limiter
    
    def check_request(self, user_id: str, endpoint: str) -> bool:
        cost = ENDPOINT_COSTS.get(endpoint, 1)
        return self.limiter.is_allowed(user_id, tokens=cost)

6.3 Adaptive Rate Limiting

Adjust limits based on system load:

class AdaptiveRateLimiter:
    def __init__(self, base_limiter: RateLimiter, load_monitor: LoadMonitor):
        self.limiter = base_limiter
        self.load_monitor = load_monitor
    
    def get_effective_limit(self, base_limit: int) -> int:
        load = self.load_monitor.get_current_load()  # 0.0 to 1.0
        
        if load < 0.5:
            # Low load - allow 20% more
            return int(base_limit * 1.2)
        elif load < 0.8:
            # Normal load - use base limit
            return base_limit
        elif load < 0.95:
            # High load - reduce by 20%
            return int(base_limit * 0.8)
        else:
            # Critical load - reduce by 50%
            return int(base_limit * 0.5)

6.4 Rate Limit Exemptions

Some requests should bypass rate limits:

EXEMPT_PATTERNS = [
    '/health',
    '/metrics',
    '/api/webhooks/stripe',  # Stripe webhook callbacks
]

EXEMPT_API_KEYS = {
    'internal-service-key-123',
    'monitoring-key-456',
}

def should_exempt(request: Request) -> bool:
    # Check path exemptions
    for pattern in EXEMPT_PATTERNS:
        if request.url.path.startswith(pattern):
            return True
    
    # Check API key exemptions
    api_key = request.headers.get('X-API-Key')
    if api_key in EXEMPT_API_KEYS:
        return True
    
    return False

Chapter 7: Rate Limiting at the Edge

7.1 CDN-Level Rate Limiting

For DDoS protection, rate limit at the edge:

Internet
    │
    ▼
┌─────────────────────────────────────────────────┐
│              CDN / Edge Network                  │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐         │
│  │  Edge   │  │  Edge   │  │  Edge   │  ...    │
│  │  POP 1  │  │  POP 2  │  │  POP 3  │         │
│  │(Tokyo)  │  │(London) │  │(NYC)    │         │
│  └────┬────┘  └────┬────┘  └────┬────┘         │
│       │            │            │               │
│       └────────────┼────────────┘               │
│                    │                            │
│              Rate Limit Check                   │
│         (per IP, per region, global)           │
└────────────────────┼────────────────────────────┘
                     │
                     ▼
              Origin Servers

Cloudflare rate limiting rules (example):

Rule: Block if > 100 requests in 10 seconds from same IP
Action: Block for 1 hour
Scope: All endpoints

Rule: Challenge if > 1000 requests in 1 minute from same /24 subnet
Action: JavaScript challenge
Scope: Login endpoints

7.2 API Gateway Rate Limiting

AWS API Gateway, Kong, or similar:

# Kong rate limiting configuration
plugins:
  - name: rate-limiting
    config:
      minute: 100
      hour: 1000
      policy: redis
      redis_host: redis.example.com
      redis_port: 6379
      redis_database: 0
      fault_tolerant: true  # Continue if Redis fails
      hide_client_headers: false

Benefits:

No application code changes
Centralized configuration
Can rate limit before hitting your servers

Limitations:

Less flexibility than application-level
Can't easily do user-tier-based limits without custom integration

Part IV: Discussion and Trade-offs

Chapter 8: The Hard Questions

8.1 "One Redis node dies—what's the behavior?"

Complete answer:

"When a Redis node in a cluster dies, keys hashed to that node are temporarily unavailable. Here's my approach:

First, understand the impact: with a 3-node cluster, ~33% of keys are affected. Rate limit checks for those keys will fail.

For the ~30-second failover window, I'd use a fallback strategy:

try:
    return redis_limiter.is_allowed(key)
except redis.ClusterDownError:
    return local_fallback_limiter.is_allowed(key)

The local fallback uses a conservative limit—maybe 10% of the global limit per server. With 50 servers, that's 500% of the global limit in total, so we might over-allow 5x during the outage. That's acceptable for 30 seconds.

After failover, the new primary takes over. If it was slightly behind the failed primary, we might have lost some recent increments—effectively a mini-reset of rate limits for those keys. I'd accept this rather than blocking all requests.

The key is: monitor this scenario. Alert when Redis cluster is degraded. Log when fallback mode activates. Review after every incident."

8.2 "Would you over-allow or under-allow during failures?"

Complete answer:

"It depends on what the rate limit is protecting.

I'd over-allow for:

User-facing API limits (don't block paying customers)
General abuse prevention (attackers can wait 30 seconds anyway)
Non-critical endpoints

I'd under-allow for:

Cost control on expensive downstream APIs (financial risk)
Resource protection when the system is already stressed
Security-sensitive endpoints (login, password reset)

For our platform, I'd configure per-rule behavior:

class RateLimitRule:
    ...
    failure_mode: str  # 'open' or 'closed'

Most rules would be 'open'. But the rule limiting calls to our AI model (which costs $0.01 per call) would be 'closed'—I'd rather block users than accidentally run up a huge bill.

The hybrid approach—local conservative limits during outage—gives a middle ground. Users aren't completely blocked, but we're not completely unprotected either."

8.3 "How do you prevent abuse of the rate limit system itself?"

Potential attacks:

Key enumeration: Attacker tries to fill up rate limit storage with millions of unique keys
Clock manipulation: If using client timestamps, attacker sends fake times
Distributed attack: Attacker uses many IPs to stay under per-IP limits

Defenses:

# Defense against key enumeration
class SecureRateLimiter:
    def __init__(self, max_unique_keys: int = 1_000_000):
        self.max_unique_keys = max_unique_keys
    
    def is_allowed(self, key: str) -> bool:
        # Use server time, not client time
        now = time.time()
        
        # Limit total keys tracked (with LRU eviction)
        if self.key_count() > self.max_unique_keys:
            self.evict_oldest_keys()
        
        # For anonymous users, use IP + user agent hash
        # This makes it harder to generate unique keys
        ...

# Defense against distributed attacks
class IPSubnetLimiter:
    def is_allowed(self, ip: str) -> bool:
        # Limit per /24 subnet, not just per IP
        subnet = self.get_subnet(ip, prefix_length=24)
        return self.limiter.is_allowed(f"subnet:{subnet}")

Chapter 9: Session Summary

What You Should Know Now

After this session, you should be able to:

Explain the four main algorithms (fixed window, sliding window log, sliding window counter, token bucket) with trade-offs
Design distributed rate limiting using Redis or local+sync approaches
Handle failure modes with explicit over-allow or under-allow decisions
Implement multi-level rate limits (per-user, per-API-key, per-endpoint)
Monitor and debug rate limiting systems

Key Trade-offs to Remember

Decision	Trade-off
Fixed vs Sliding window	Simplicity vs Accuracy at boundaries
Token bucket vs Sliding window	Allows bursting vs Strict limits
Centralized vs Local counters	Accuracy vs Latency/Resilience
Over-allow vs Under-allow	User experience vs Protection
Per-IP vs Per-user limits	Coverage vs Fairness to shared IPs

Questions to Ask in Every Design

What are we protecting? (Resources? Cost? Fairness?)
What's the acceptable accuracy? (Exact? Within 10%?)
What happens during Redis failure?
How do we communicate limits to users?
How do we handle different user tiers?

Part V: Interview Questions and Answers

Chapter 10: Real-World Interview Scenarios

10.1 Conceptual Questions

Question 1: "Explain the token bucket algorithm."

Interviewer's Intent: Testing fundamental algorithm knowledge.

Strong Answer:

"Token bucket is a rate limiting algorithm that allows controlled bursting. Imagine a bucket that holds tokens—it fills at a constant rate (say, 10 tokens per second) up to a maximum capacity (say, 100 tokens).

Each request consumes tokens from the bucket. If there are enough tokens, the request is allowed and tokens are deducted. If not, the request is rejected.

The key properties:

Long-term rate is bounded by the fill rate (10/sec)
Short-term bursts are allowed up to bucket capacity (100)
After a burst, you have to wait for the bucket to refill

Implementation-wise, you don't actually simulate filling—you calculate tokens based on time elapsed:

tokens = min(capacity, tokens + (elapsed_time * rate))

I'd use token bucket when I want to allow bursting, like a user who's been idle for a while making several rapid requests. For strict rate enforcement without bursting, I'd use sliding window counter instead."

Question 2: "What's the boundary problem with fixed window rate limiting?"

Interviewer's Intent: Testing understanding of algorithm edge cases.

Strong Answer:

"Fixed window rate limiting divides time into fixed intervals—say, minute boundaries. You count requests in each minute and reset at the boundary.

The problem: a user can make their full quota at the end of one window and again at the start of the next window, effectively doubling their rate.

Example with 100 requests/minute limit:

99 requests at 12:00:59 (Window 1)
99 requests at 12:01:01 (Window 2)

That's 198 requests in 2 seconds, way over the intended rate.

Sliding window counter fixes this by weighting the previous window's count based on how much of it overlaps with the current sliding window. If we're 30 seconds into the new window, we count 50% of the previous window's requests plus all of the current window's requests.

In practice, I'd always use sliding window counter for user-facing rate limits. Fixed window is only acceptable for very rough limiting where 2x burst at boundaries is tolerable."

Question 3: "How would you rate limit across multiple servers?"

Interviewer's Intent: Testing distributed systems thinking.

Strong Answer:

"The challenge is that without coordination, each server only sees a fraction of requests. A user could exceed limits by spreading requests across servers.

Three approaches:

Centralized store (Redis): All servers check/update a shared counter. Accurate but adds latency and creates a dependency.

Local + periodic sync: Each server has local counters, periodically syncing to Redis. Lower latency, but accuracy suffers between syncs.

Sticky sessions: Route each user to a consistent server. No coordination needed, but uneven load distribution and failover complexity.

I'd typically use Redis with the sliding window counter algorithm. The latency hit (~1-2ms) is acceptable for most APIs. For ultra-low-latency requirements, I'd use local counters with sync, accepting that we might over-allow by up to (number of servers × local limit) during the sync interval.

The key is building in graceful degradation: if Redis fails, fall back to local limits rather than failing completely open or closed."

10.2 Design Questions

Question 4: "Design a rate limiter for a multi-tenant SaaS API."

Interviewer's Intent: Testing end-to-end system design.

Strong Answer:

"Let me understand the requirements. Multi-tenant means different customers with potentially different limits. I'll assume we have tiers: free (100 req/min), pro (1000 req/min), and enterprise (custom).

Architecture:

Request → Identify Tenant → Look Up Limits → Check Counter → Allow/Deny

Tenant identification: From API key in header. Each API key maps to a tenant and tier.

Limit storage: I'd store limit configurations in PostgreSQL with caching:

class TenantConfig:
    tenant_id: str
    tier: str
    custom_limits: dict  # For enterprise overrides

Counter storage: Redis with sliding window counter. Keys like ratelimit:{tenant_id}:{endpoint_group}.

Multi-level limits:

Per-tenant overall (1000/min)
Per-tenant per-endpoint (e.g., 100/min for writes)
Global system protection (100K/min total across all tenants)

Response: 429 Too Many Requests with headers showing limit, remaining, and reset time.

Failure handling: If Redis is down, fall back to local counters. Free tier gets strict local limits (fail closed to protect resources). Paid tiers get lenient local limits (fail open to avoid blocking paying customers).

Monitoring: Track rate limit hits per tenant, alert on high rejection rates, dashboard showing top consumers."

Question 5: "Design rate limiting for a real-time chat application."

Interviewer's Intent: Testing adaptation to specific use case.

Strong Answer:

"Chat has different requirements than typical APIs. Users expect instant message delivery, and messages come in bursts (typing a conversation).

Key considerations:

Users chat in bursts—strict per-second limits would feel terrible
Different limits for different actions (messages, media uploads, connection attempts)
Abuse vectors: spam, flooding a room, repeated connection attempts

My approach:

Token bucket for messages: Fill rate 1 token/second, capacity 20 tokens. Users can burst 20 messages quickly (normal conversation), but sustained spam is blocked.

message_limits = {
    'direct_message': TokenBucket(rate=1, capacity=20),
    'group_message': TokenBucket(rate=0.5, capacity=10),  # Stricter for groups
    'media_upload': TokenBucket(rate=0.1, capacity=3),  # Much stricter
}

Fixed window for connections: Limit reconnection attempts to prevent connection-spam attacks:

connection_limit = FixedWindow(limit=10, window_minutes=1)

Per-room limits: In addition to per-user limits, limit the total message rate per room to prevent coordinated spam:

room_limit = SlidingWindow(limit=100, window_seconds=10)

Implementation: Local limits on each chat server (which maintains WebSocket connections), with async sync to Redis for cross-server visibility. The local-first approach is crucial for low latency—we can't add a Redis round trip to every message.

User experience: Instead of hard rejection, I'd queue messages that exceed soft limits and deliver them slowly. Only hard reject for severe abuse."

10.3 Scenario-Based Questions

Question 6: "Your rate limiter is rejecting legitimate users during traffic spikes. What do you do?"

Interviewer's Intent: Testing operational problem-solving.

Strong Answer:

"First, triage: Is this affecting all users or specific ones? Are limits correctly configured? Is it actually the rate limiter or something else?

Investigation steps:

Check rate limiter metrics: What's the rejection rate? Which rules are triggering?
Look at specific users: Are they actually exceeding limits, or is the rate limiter malfunctioning?
Check Redis health: High latency or failures could cause incorrect counts.

Immediate mitigations (in order of risk):

Increase limits temporarily: If legitimate traffic grew, raise limits to accommodate:
```
redis-cli SET ratelimit:config:user_basic:limit 200  # Double the limit
```
Add more headroom per tier: Maybe our limits were too conservative.
Switch to local-only limiting temporarily: If Redis is the bottleneck, local limits might be less restrictive.
Exempt specific users/endpoints: If one critical customer is blocked, add them to exemption list while we fix things.

Post-incident:

Review whether limits match actual usage patterns
Consider adaptive rate limiting that adjusts to traffic patterns
Add alerting for rejection rate spikes
Maybe implement graduated responses: soft limit (warn) before hard limit (block)"

Question 7: "How would you test your rate limiter?"

Interviewer's Intent: Testing quality engineering mindset.

Strong Answer:

"Testing rate limiters requires validating both correctness and behavior under failure.

Unit tests for algorithms:

def test_sliding_window_allows_under_limit():
    limiter = SlidingWindowLimiter(limit=10, window_seconds=60)
    for _ in range(10):
        assert limiter.is_allowed('user1') == True
    assert limiter.is_allowed('user1') == False

def test_sliding_window_resets():
    limiter = SlidingWindowLimiter(limit=10, window_seconds=1)
    for _ in range(10):
        limiter.is_allowed('user1')
    time.sleep(1)
    assert limiter.is_allowed('user1') == True

def test_fixed_window_boundary_burst():
    # Verify we understand the boundary problem
    limiter = FixedWindowLimiter(limit=10, window_seconds=60)
    # Simulate requests at window boundary
    ...

Integration tests with Redis:

def test_distributed_counting():
    # Simulate multiple servers
    limiter1 = RedisRateLimiter(redis_client, limit=10, window=60)
    limiter2 = RedisRateLimiter(redis_client, limit=10, window=60)
    
    for _ in range(5):
        limiter1.is_allowed('user1')
        limiter2.is_allowed('user1')
    
    # Both should see the same count
    assert limiter1.is_allowed('user1') == False

Load tests:

# Hit the rate limiter with 10K requests/second
# Verify:
# - Latency stays under 5ms p99
# - Correct rejection rate (within 10% of expected)
# - No memory leaks
# - Redis connection pool is healthy

Chaos tests:

def test_redis_failure_recovery():
    # Make requests (should use Redis)
    # Kill Redis
    # Make requests (should use fallback)
    # Verify fallback behavior matches config (open/closed)
    # Restart Redis
    # Make requests (should use Redis again)

Production testing:

Shadow mode: Log what would be rejected without actually rejecting
Canary: Route 1% of traffic to new limiter version
Gradual rollout with kill switch"

10.4 Deep-Dive Questions

Question 8: "Compare rate limiting implementations in Redis vs in-memory."

Interviewer's Intent: Testing infrastructure trade-off knowledge.

Strong Answer:

"Redis-based:

Pros:

Shared state across all servers (accurate global limits)
Persistent (survives server restarts)
Atomic operations (Lua scripts for complex logic)
Scales independently of application servers

Cons:

Network latency (1-2ms per check)
Single point of failure (even clustered)
Operational overhead
Cost at scale

In-memory:

Pros:

Near-zero latency (~microseconds)
No external dependencies
Simple deployment
No network failure modes

Cons:

Per-server limits only (inaccurate global limits)
State lost on restart
Scales with application servers (might not want that)
Each server uses memory

When to use each:

Redis: User-facing API limits where accuracy matters, multi-tenant systems, when you need persistence.

In-memory: Ultra-low-latency requirements, per-connection limits (like WebSocket messages), local caching layer before Redis.

Hybrid: In-memory first pass (reject obvious violations), Redis for border cases. Or in-memory with periodic Redis sync.

For most web APIs, Redis with proper fallback is the right choice. The 1-2ms overhead is negligible compared to typical API latencies."

Question 9: "How do services like Cloudflare and AWS do rate limiting at scale?"

Interviewer's Intent: Testing knowledge of real-world systems.

Strong Answer:

"Cloudflare:

They rate limit at the edge—in their 300+ data centers worldwide. Each edge PoP makes local decisions with eventual consistency across PoPs.

For per-IP limits, this works well—most requests from one IP hit the same PoP. For per-user limits (requiring API key lookup), they need to route to origin or use distributed state.

They use a combination of in-memory counting at the edge and centralized aggregation for global limits. The key insight is that for DDoS protection, exact accuracy isn't needed—blocking 90% of an attack is almost as good as blocking 100%.

AWS API Gateway:

Uses token bucket algorithm. Regional limits are enforced within each AWS region. For global limits across regions, you'd need to build your own coordination.

They integrate with AWS WAF for additional IP-based limiting. The rate limiting state is stored in their internal distributed systems, not exposed to users.

Stripe:

Uses a custom sliding window implementation. They're notable for excellent rate limit headers and detailed error messages. They have per-resource limits (different limits for creating customers vs creating charges).

Common patterns:

Edge limiting for DDoS (approximate, fast)
Application limiting for quotas (accurate, slower)
Hierarchical limits (IP < User < Account < Global)
Graceful degradation during partial failures

The theme is: different types of limits at different layers, accepting that distributed rate limiting is inherently approximate."

Chapter 11: Interview Preparation Checklist

Before your interview, make sure you can:

Algorithms

Explain fixed window, sliding window log, sliding window counter, token bucket
Describe the boundary problem and how sliding window solves it
Know when to use token bucket vs sliding window

Distributed Systems

Design rate limiting with Redis
Explain the coordination problem across multiple servers
Implement fallback strategies for Redis failure

Failure Modes

Articulate over-allow vs under-allow trade-off
Design graceful degradation
Know what to monitor and alert on

Real-World Knowledge

Know how CDNs/API Gateways do rate limiting
Understand multi-tenant rate limiting
Cite real examples (Stripe, Cloudflare)

Exercises

Exercise 1: Algorithm Implementation

Implement a sliding window counter rate limiter that:

Supports configurable window size and limit
Returns remaining quota and reset time
Handles concurrent access safely

Exercise 2: Distributed Design

Design a rate limiting system for a video streaming service:

1M concurrent users
Limits on: video starts (5/min), API calls (100/min), bandwidth (10GB/day)
Global and per-user limits
Handle CDN edge servers and origin servers

Exercise 3: Failure Analysis

Your rate limiter uses a 3-node Redis cluster. Analyze what happens in each scenario:

One node becomes slow (100ms latency)
One node dies and fails over (30 second outage)
Network partition isolates your app servers from Redis
Redis runs out of memory

For each, describe the user experience and your recommended handling.

Appendix: Code Reference

A.1 Complete Redis Rate Limiter

import redis
import time
from dataclasses import dataclass
from typing import Optional, Tuple
from enum import Enum

class RateLimitAlgorithm(Enum):
    SLIDING_WINDOW = "sliding_window"
    TOKEN_BUCKET = "token_bucket"

@dataclass
class RateLimitResult:
    allowed: bool
    limit: int
    remaining: int
    reset: int
    retry_after: Optional[int] = None

class ProductionRateLimiter:
    """
    Production-ready rate limiter with:
    - Multiple algorithm support
    - Graceful degradation
    - Comprehensive metrics
    - Multi-key support
    """
    
    SLIDING_WINDOW_SCRIPT = """
    local key = KEYS[1]
    local now = tonumber(ARGV[1])
    local window = tonumber(ARGV[2])
    local limit = tonumber(ARGV[3])
    
    local window_start = now - window
    redis.call('ZREMRANGEBYSCORE', key, 0, window_start)
    local count = redis.call('ZCARD', key)
    
    if count < limit then
        redis.call('ZADD', key, now, now .. ':' .. math.random())
        redis.call('EXPIRE', key, window)
        return {1, limit - count - 1, math.ceil(now + window)}
    else
        local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
        local reset = oldest[2] and (oldest[2] + window) or (now + window)
        return {0, 0, math.ceil(reset)}
    end
    """
    
    TOKEN_BUCKET_SCRIPT = """
    local key = KEYS[1]
    local rate = tonumber(ARGV[1])
    local capacity = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])
    local requested = tonumber(ARGV[4])
    
    local data = redis.call('HMGET', key, 'tokens', 'last_update')
    local tokens = tonumber(data[1]) or capacity
    local last_update = tonumber(data[2]) or now
    
    local elapsed = now - last_update
    tokens = math.min(capacity, tokens + (elapsed * rate))
    
    local allowed = 0
    if tokens >= requested then
        tokens = tokens - requested
        allowed = 1
    end
    
    redis.call('HMSET', key, 'tokens', tokens, 'last_update', now)
    redis.call('EXPIRE', key, math.ceil(capacity / rate) * 2)
    
    local reset = now + ((capacity - tokens) / rate)
    return {allowed, math.floor(tokens), math.ceil(reset)}
    """
    
    def __init__(
        self,
        redis_client: redis.Redis,
        default_limit: int = 100,
        default_window: int = 60,
        algorithm: RateLimitAlgorithm = RateLimitAlgorithm.SLIDING_WINDOW,
        failure_mode: str = 'open',
        local_fallback_limit: int = 10
    ):
        self.redis = redis_client
        self.default_limit = default_limit
        self.default_window = default_window
        self.algorithm = algorithm
        self.failure_mode = failure_mode
        self.local_fallback_limit = local_fallback_limit
        
        # Register Lua scripts
        self.sliding_window_script = self.redis.register_script(self.SLIDING_WINDOW_SCRIPT)
        self.token_bucket_script = self.redis.register_script(self.TOKEN_BUCKET_SCRIPT)
        
        # Local fallback
        self.local_counts: dict[str, list[float]] = {}
        self.redis_healthy = True
    
    def is_allowed(
        self,
        key: str,
        limit: Optional[int] = None,
        window: Optional[int] = None,
        cost: int = 1
    ) -> RateLimitResult:
        """
        Check if request is allowed under rate limit.
        """
        limit = limit or self.default_limit
        window = window or self.default_window
        
        try:
            if self.algorithm == RateLimitAlgorithm.SLIDING_WINDOW:
                result = self._sliding_window_check(key, limit, window)
            else:
                result = self._token_bucket_check(key, limit, window, cost)
            
            self.redis_healthy = True
            return result
            
        except redis.RedisError as e:
            self.redis_healthy = False
            return self._fallback_check(key, limit, window)
    
    def _sliding_window_check(
        self,
        key: str,
        limit: int,
        window: int
    ) -> RateLimitResult:
        now = time.time()
        result = self.sliding_window_script(
            keys=[f"ratelimit:sw:{key}"],
            args=[now, window, limit]
        )
        
        return RateLimitResult(
            allowed=bool(result[0]),
            limit=limit,
            remaining=result[1],
            reset=result[2],
            retry_after=None if result[0] else max(0, result[2] - int(now))
        )
    
    def _token_bucket_check(
        self,
        key: str,
        limit: int,
        window: int,
        cost: int
    ) -> RateLimitResult:
        now = time.time()
        rate = limit / window
        capacity = limit
        
        result = self.token_bucket_script(
            keys=[f"ratelimit:tb:{key}"],
            args=[rate, capacity, now, cost]
        )
        
        return RateLimitResult(
            allowed=bool(result[0]),
            limit=limit,
            remaining=result[1],
            reset=result[2],
            retry_after=None if result[0] else max(0, result[2] - int(now))
        )
    
    def _fallback_check(self, key: str, limit: int, window: int) -> RateLimitResult:
        """
        Local fallback when Redis is unavailable.
        """
        if self.failure_mode == 'closed':
            return RateLimitResult(
                allowed=False,
                limit=limit,
                remaining=0,
                reset=int(time.time()) + window,
                retry_after=window
            )
        
        # Fail open with local limiting
        now = time.time()
        window_start = now - window
        
        if key not in self.local_counts:
            self.local_counts[key] = []
        
        # Remove old entries
        self.local_counts[key] = [t for t in self.local_counts[key] if t > window_start]
        
        count = len(self.local_counts[key])
        local_limit = min(self.local_fallback_limit, limit)
        
        if count < local_limit:
            self.local_counts[key].append(now)
            return RateLimitResult(
                allowed=True,
                limit=local_limit,
                remaining=local_limit - count - 1,
                reset=int(now + window)
            )
        else:
            return RateLimitResult(
                allowed=False,
                limit=local_limit,
                remaining=0,
                reset=int(now + window),
                retry_after=window
            )


# FastAPI integration
from fastapi import Request, HTTPException
from fastapi.responses import JSONResponse

def rate_limit_middleware(limiter: ProductionRateLimiter):
    async def middleware(request: Request, call_next):
        # Extract key from request
        key = request.headers.get('X-API-Key') or request.client.host
        
        result = limiter.is_allowed(key)
        
        if not result.allowed:
            return JSONResponse(
                status_code=429,
                content={
                    'error': 'rate_limit_exceeded',
                    'retry_after': result.retry_after
                },
                headers={
                    'X-RateLimit-Limit': str(result.limit),
                    'X-RateLimit-Remaining': str(result.remaining),
                    'X-RateLimit-Reset': str(result.reset),
                    'Retry-After': str(result.retry_after)
                }
            )
        
        response = await call_next(request)
        response.headers['X-RateLimit-Limit'] = str(result.limit)
        response.headers['X-RateLimit-Remaining'] = str(result.remaining)
        response.headers['X-RateLimit-Reset'] = str(result.reset)
        
        return response
    
    return middleware

A.2 Rate Limiter Testing Suite

import pytest
import time
import fakeredis
from unittest.mock import patch, MagicMock

class TestSlidingWindowLimiter:
    @pytest.fixture
    def limiter(self):
        redis_client = fakeredis.FakeRedis()
        return ProductionRateLimiter(
            redis_client=redis_client,
            default_limit=10,
            default_window=60,
            algorithm=RateLimitAlgorithm.SLIDING_WINDOW
        )
    
    def test_allows_under_limit(self, limiter):
        for i in range(10):
            result = limiter.is_allowed('user1')
            assert result.allowed, f"Request {i+1} should be allowed"
            assert result.remaining == 10 - i - 1
    
    def test_blocks_over_limit(self, limiter):
        for _ in range(10):
            limiter.is_allowed('user1')
        
        result = limiter.is_allowed('user1')
        assert not result.allowed
        assert result.remaining == 0
        assert result.retry_after > 0
    
    def test_separate_keys(self, limiter):
        for _ in range(10):
            limiter.is_allowed('user1')
        
        # Different key should still be allowed
        result = limiter.is_allowed('user2')
        assert result.allowed
    
    def test_window_reset(self, limiter):
        limiter.default_window = 1  # 1 second window
        
        for _ in range(10):
            limiter.is_allowed('user1')
        
        result = limiter.is_allowed('user1')
        assert not result.allowed
        
        time.sleep(1.1)
        
        result = limiter.is_allowed('user1')
        assert result.allowed


class TestFallbackBehavior:
    def test_fail_open(self):
        redis_client = MagicMock()
        redis_client.register_script.return_value = MagicMock(
            side_effect=redis.RedisError("Connection refused")
        )
        
        limiter = ProductionRateLimiter(
            redis_client=redis_client,
            failure_mode='open',
            local_fallback_limit=5
        )
        
        # Should use local fallback
        for i in range(5):
            result = limiter.is_allowed('user1')
            assert result.allowed
        
        # Local limit exceeded
        result = limiter.is_allowed('user1')
        assert not result.allowed
    
    def test_fail_closed(self):
        redis_client = MagicMock()
        redis_client.register_script.return_value = MagicMock(
            side_effect=redis.RedisError("Connection refused")
        )
        
        limiter = ProductionRateLimiter(
            redis_client=redis_client,
            failure_mode='closed'
        )
        
        result = limiter.is_allowed('user1')
        assert not result.allowed

End of Day 3: Rate Limiting at Scale

Tomorrow: Day 4 — Hot Keys and Skew. We'll redesign the URL shortener assuming 0.01% of URLs get 90% of traffic, and explore how companies like Instagram handle celebrity posts that get millions of interactions.

Back to Course Overview