Week 4 — Day 2: Invalidation Strategies
System Design Mastery Series
Preface
Yesterday, you learned the four caching patterns. You can now put data into a cache efficiently.
Today, we tackle the harder problem: getting stale data out.
THE INCIDENT
Monday, 2:47 PM — Customer Support tickets spike
Ticket #4521: "Product shows $99, but I was charged $79"
Ticket #4522: "Price on page doesn't match cart"
Ticket #4523: "Flash sale price not showing"
Ticket #4524: "Wrong price displayed"
... 147 more tickets
Investigation:
10:00 AM — Marketing starts flash sale
10:00 AM — Database updated: price $99 → $79
10:00 AM — Cache still has: price $99
Cache TTL: 1 hour
10:00 - 11:00 AM — Users see $99, charged $79
11:00 AM — Cache expires, new price $79 appears
Result:
- 147 confused customers
- Support team overwhelmed
- Trust damaged
- "Why didn't the cache update?"
The cache did exactly what it was told.
The problem: nobody told it the price changed.
This is the cache invalidation problem — and it's notoriously hard.
Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. Today, you'll understand why invalidation is hard — and learn the strategies that actually work.
Part I: Foundations
Chapter 1: Why Invalidation Is Hard
1.1 The Core Problem
A cache is a copy of data. When the original changes, the copy becomes stale.
THE STALENESS PROBLEM
Source of Truth (Database):
T0: product.price = $100
T1: product.price = $79 (updated!)
T2: product.price = $79
T3: product.price = $79
Cache:
T0: product.price = $100 (correct)
T1: product.price = $100 (STALE!)
T2: product.price = $100 (STALE!)
T3: product.price = $100 (STALE!)
The cache doesn't know the database changed.
How do we tell it?
1.2 The Invalidation Trilemma
You can optimize for two of three properties:
THE INVALIDATION TRILEMMA
Freshness
/\
/ \
/ \
/ \
/ \
/__________\
Simplicity Performance
FRESHNESS: Data in cache matches database
SIMPLICITY: Easy to implement and maintain
PERFORMANCE: Low latency, high throughput
Trade-offs:
TTL-based:
✓ Simple
✓ Good performance
✗ Not always fresh (stale until TTL expires)
Event-driven:
✓ Fresh (near real-time)
✗ Complex (need event infrastructure)
✓ Good performance
Write-through:
✓ Always fresh
✗ Complex (coupling)
✗ Slower writes (must update cache)
1.3 When Staleness Matters
Not all data needs the same freshness:
| Data Type | Staleness Tolerance | Why |
|---|---|---|
| Stock prices | < 1 second | Trading decisions |
| Inventory count | < 30 seconds | Overselling risk |
| Product price | < 5 minutes | Customer trust |
| Product description | Hours | Rarely changes |
| User profile | Minutes to hours | Low impact |
| Static content | Days | Never changes |
DESIGN PRINCIPLE
Match invalidation strategy to staleness tolerance.
Don't use the same strategy for everything.
Product descriptions don't need real-time invalidation.
Inventory counts probably do.
1.4 Key Terminology
| Term | Definition |
|---|---|
| Stale data | Cached data that differs from source of truth |
| TTL (Time-To-Live) | Duration before cache entry expires |
| Invalidation | Removing or updating stale cache entries |
| Cache miss | Data not in cache, must fetch from source |
| Write-invalidate | Delete cache entry when data changes |
| Write-update | Update cache entry when data changes |
| Tombstone | Marker indicating deleted data |
Chapter 2: Invalidation Strategies
2.1 Strategy 1: Time-To-Live (TTL)
The simplest strategy: cache entries automatically expire after a set duration.
TTL-BASED INVALIDATION
Timeline:
T0: Cache MISS, load from DB, cache with TTL=300s
T0-T300: Cache HIT (even if DB changes!)
T300: Cache entry expires
T301: Cache MISS, reload from DB
How it works:
cache.setex("product:123", 300, product_data)
↑
TTL in seconds
After 300 seconds, key automatically deleted.
Next request triggers cache miss and reload.
Implementation:
# TTL-Based Invalidation
class TTLCache:
"""
Simple TTL-based caching.
Pros: Simple, self-healing
Cons: Stale for up to TTL duration
"""
def __init__(self, redis_client, default_ttl: int = 300):
self.redis = redis_client
self.default_ttl = default_ttl
async def get_product(self, product_id: str) -> dict:
cache_key = f"product:{product_id}"
# Check cache
cached = await self.redis.get(cache_key)
if cached:
return json.loads(cached)
# Miss: load from database
product = await self.db.fetch_one(
"SELECT * FROM products WHERE id = $1",
product_id
)
# Cache with TTL
await self.redis.setex(
cache_key,
self.default_ttl, # Expires in 300 seconds
json.dumps(dict(product))
)
return dict(product)
Choosing TTL Values:
TTL SELECTION GUIDE
Question: "How stale is acceptable?"
Very fresh needed (seconds):
- Inventory counts during sales
- Real-time dashboards
- Session data (sliding expiration)
TTL: 10-60 seconds
Moderate freshness (minutes):
- Product prices
- User preferences
- Search results
TTL: 1-15 minutes
Low freshness OK (hours):
- Product descriptions
- Category listings
- User profiles (non-critical)
TTL: 1-24 hours
Static content (days):
- Images, CSS, JS
- Configuration that rarely changes
- Historical data
TTL: 1-30 days
FORMULA FOR TTL:
TTL = (acceptable_staleness) × (safety_factor)
Example:
"Users can tolerate 5 minutes of stale prices"
TTL = 5 minutes × 0.8 = 4 minutes
Safety factor accounts for clock skew, network delays
Pros:
- Extremely simple to implement
- Self-healing (stale data eventually expires)
- No external dependencies
- Predictable behavior
Cons:
- Data is stale for up to TTL duration
- No control over when refresh happens
- Short TTL = more cache misses = more DB load
- Long TTL = longer staleness
2.2 Strategy 2: Event-Driven Invalidation
When data changes, publish an event. Consumers invalidate relevant cache entries.
EVENT-DRIVEN INVALIDATION
Write Path:
1. Update database
2. Publish "product.updated" event
3. Cache invalidator consumes event
4. Delete/update cache entry
Timeline:
T0: DB update + publish event
T0.1: Event consumed
T0.2: Cache invalidated
T0.3: Next read gets fresh data
Latency: ~100-500ms (near real-time)
┌─────────────────┐
│ Product │
Update ──────────▶│ Service │
│ │
└────────┬────────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Database │ │ Kafka │
│ (write) │ │ (event) │
└─────────────┘ └──────┬──────┘
│
▼
┌─────────────┐
│ Cache │
│ Invalidator │
└──────┬──────┘
│
▼
┌─────────────┐
│ Redis │
│ (delete) │
└─────────────┘
Implementation:
# Event-Driven Invalidation
from dataclasses import dataclass
from typing import List, Optional
import json
@dataclass
class ProductUpdatedEvent:
"""Event published when a product changes."""
product_id: str
changed_fields: List[str]
timestamp: str
source: str # Which service made the change
class ProductService:
"""
Product service that publishes events on changes.
"""
def __init__(self, db, kafka_producer, cache):
self.db = db
self.kafka = kafka_producer
self.cache = cache
async def update_product(self, product_id: str, updates: dict) -> dict:
"""Update product and publish event for cache invalidation."""
# Update database
product = await self.db.fetch_one(
"""
UPDATE products
SET name = COALESCE($2, name),
price = COALESCE($3, price),
updated_at = NOW()
WHERE id = $1
RETURNING *
""",
product_id, updates.get('name'), updates.get('price')
)
# Publish event for cache invalidation
event = ProductUpdatedEvent(
product_id=product_id,
changed_fields=list(updates.keys()),
timestamp=datetime.utcnow().isoformat(),
source="product-service"
)
await self.kafka.send(
"product-events",
key=product_id.encode(),
value=json.dumps(event.__dict__).encode()
)
return dict(product)
class CacheInvalidationConsumer:
"""
Consumes product events and invalidates cache.
Runs as a separate service/process.
"""
def __init__(self, kafka_consumer, redis_client):
self.kafka = kafka_consumer
self.redis = redis_client
async def run(self):
"""Main consumer loop."""
async for message in self.kafka:
try:
event = json.loads(message.value.decode())
await self.handle_event(event)
await self.kafka.commit()
except Exception as e:
logger.error(f"Failed to process event: {e}")
# Send to DLQ (Week 3, Day 4)
async def handle_event(self, event: dict):
"""Handle a product event."""
event_type = event.get('type', 'product.updated')
product_id = event['product_id']
if event_type in ('product.updated', 'product.deleted'):
# Invalidate product cache
await self.redis.delete(f"product:{product_id}")
# Also invalidate related caches
await self.redis.delete(f"product_page:{product_id}")
await self.redis.delete(f"category_products:{event.get('category_id')}")
logger.info(f"Invalidated cache for product {product_id}")
elif event_type == 'product.price_changed':
# Price change might affect more things
await self.redis.delete(f"product:{product_id}")
await self.redis.delete(f"deals_page") # Might appear in deals
await self.redis.delete(f"search_results:*") # Invalidate search caches
Handling Related Caches:
# Products appear in multiple caches - invalidate all of them
CACHE_DEPENDENCIES = {
"product": [
"product:{id}", # Direct product cache
"product_page:{id}", # Rendered product page
"category:{category_id}", # Category listing
"search:*", # Search results (pattern)
],
"inventory": [
"product:{id}", # Product includes stock info
"product_page:{id}",
"availability:{id}",
],
"price": [
"product:{id}",
"product_page:{id}",
"deals_page", # If product is on sale
"cart:{user_id}", # User's cart might show price
]
}
async def invalidate_product(product_id: str, change_type: str, metadata: dict):
"""Invalidate all caches affected by a product change."""
patterns = CACHE_DEPENDENCIES.get(change_type, ["product:{id}"])
for pattern in patterns:
# Replace placeholders
key = pattern.format(
id=product_id,
category_id=metadata.get('category_id', '*'),
user_id='*' # Wildcard for user-specific caches
)
if '*' in key:
# Pattern delete (use with caution - expensive!)
await delete_by_pattern(key)
else:
await redis.delete(key)
Pros:
- Near real-time freshness
- Precise control over what gets invalidated
- Efficient (only invalidate what changed)
- Scales well with proper event infrastructure
Cons:
- Requires event infrastructure (Kafka, etc.)
- More complex to implement
- Event delivery isn't guaranteed (need DLQ)
- Coupling between services
2.3 Strategy 3: Versioned Keys
Include a version in the cache key. When data changes, increment the version.
VERSIONED KEYS
Instead of:
cache.set("product:123", data)
Use:
cache.set("product:123:v5", data)
When product changes:
- Increment version: v5 → v6
- New key: "product:123:v6"
- Old key "product:123:v5" still exists but never accessed
- Old key expires via TTL
Timeline:
T0: Version = 5, cache key = "product:123:v5"
T1: Product updated, version = 6
T2: Read uses key "product:123:v6" - MISS (new key)
T3: Load from DB, cache as "product:123:v6"
T4: Old "product:123:v5" expires naturally
Implementation:
# Versioned Key Invalidation
class VersionedCache:
"""
Cache with versioned keys.
Version is stored separately and incremented on updates.
Old versions expire naturally via TTL.
"""
def __init__(self, redis_client, default_ttl: int = 3600):
self.redis = redis_client
self.default_ttl = default_ttl
async def get_version(self, entity: str, entity_id: str) -> int:
"""Get current version for an entity."""
version_key = f"version:{entity}:{entity_id}"
version = await self.redis.get(version_key)
return int(version) if version else 1
async def increment_version(self, entity: str, entity_id: str) -> int:
"""Increment version (effectively invalidates cache)."""
version_key = f"version:{entity}:{entity_id}"
new_version = await self.redis.incr(version_key)
return new_version
def _build_key(self, entity: str, entity_id: str, version: int) -> str:
"""Build versioned cache key."""
return f"{entity}:{entity_id}:v{version}"
async def get(self, entity: str, entity_id: str) -> Optional[dict]:
"""Get cached value using current version."""
version = await self.get_version(entity, entity_id)
cache_key = self._build_key(entity, entity_id, version)
cached = await self.redis.get(cache_key)
if cached:
return json.loads(cached)
return None
async def set(self, entity: str, entity_id: str, value: dict) -> None:
"""Set cached value with current version."""
version = await self.get_version(entity, entity_id)
cache_key = self._build_key(entity, entity_id, version)
await self.redis.setex(
cache_key,
self.default_ttl,
json.dumps(value)
)
async def invalidate(self, entity: str, entity_id: str) -> int:
"""Invalidate by incrementing version."""
return await self.increment_version(entity, entity_id)
# Usage
class ProductRepository:
def __init__(self, db, cache: VersionedCache):
self.db = db
self.cache = cache
async def get_product(self, product_id: str) -> dict:
# Try cache with current version
cached = await self.cache.get("product", product_id)
if cached:
return cached
# Load from database
product = await self.db.fetch_one(
"SELECT * FROM products WHERE id = $1",
product_id
)
# Cache with current version
await self.cache.set("product", product_id, dict(product))
return dict(product)
async def update_product(self, product_id: str, data: dict) -> dict:
# Update database
product = await self.db.fetch_one(
"UPDATE products SET ... WHERE id = $1 RETURNING *",
product_id
)
# Invalidate by incrementing version
# Next read will miss (new version key doesn't exist)
await self.cache.invalidate("product", product_id)
return dict(product)
Global Version for Bulk Invalidation:
# Invalidate ALL products at once (e.g., after bulk import)
class GlobalVersionCache:
"""Cache with global version for bulk invalidation."""
async def get_global_version(self, entity: str) -> int:
"""Global version affects all entities of this type."""
version = await self.redis.get(f"global_version:{entity}")
return int(version) if version else 1
async def increment_global_version(self, entity: str) -> int:
"""Invalidate ALL cached entities of this type."""
return await self.redis.incr(f"global_version:{entity}")
def _build_key(self, entity: str, entity_id: str,
entity_version: int, global_version: int) -> str:
"""Key includes both entity and global version."""
return f"{entity}:{entity_id}:v{entity_version}:g{global_version}"
# Usage: After bulk product import
await cache.increment_global_version("product")
# All product cache keys now miss (global version changed)
Pros:
- No explicit deletion needed
- Atomic invalidation (version increment is atomic)
- Supports bulk invalidation via global version
- Old data expires naturally
Cons:
- More complex key management
- Storage overhead (old versions remain until TTL)
- Need to track version numbers
- Extra Redis call to get version
2.4 Strategy 4: Write-Invalidate vs Write-Update
When data changes, should you delete or update the cache?
WRITE-INVALIDATE (Delete)
On write:
1. Update database
2. DELETE cache entry
Next read:
3. Cache miss
4. Load from database
5. Populate cache
WRITE-UPDATE (Update)
On write:
1. Update database
2. SET cache entry with new value
Next read:
3. Cache hit (already updated)
Which is better?
The Race Condition Problem with Write-Update:
RACE CONDITION WITH WRITE-UPDATE
Two concurrent requests update the same product:
Request A (price = $90):
T1: Read from DB (price = $100)
T2: Update DB to $90
T4: Update cache to $90 ← Happens LATER
Request B (price = $80):
T1.5: Read from DB (price = $100)
T3: Update DB to $80
T3.5: Update cache to $80 ← Happens FIRST
Timeline:
T1: A reads DB ($100)
T1.5: B reads DB ($100)
T2: A writes DB ($90)
T3: B writes DB ($80) ← DB now has $80 (correct)
T3.5: B updates cache ($80)
T4: A updates cache ($90) ← Cache now has $90 (WRONG!)
Result:
Database: $80 (correct - last write wins)
Cache: $90 (STALE - from earlier update)
WITH WRITE-INVALIDATE:
T3: B writes DB ($80)
T3.5: B deletes cache
T4: A deletes cache ← Idempotent, no problem
Result:
Database: $80
Cache: empty (next read will load fresh)
No race condition!
Implementation Comparison:
# Write-Invalidate (Recommended)
async def update_product_invalidate(product_id: str, data: dict):
"""Update database, then DELETE cache."""
# 1. Update database
product = await db.fetch_one(
"UPDATE products SET price = $2 WHERE id = $1 RETURNING *",
product_id, data['price']
)
# 2. Delete cache (invalidate)
await cache.delete(f"product:{product_id}")
# Next read will miss and load fresh data
return dict(product)
# Write-Update (Use with caution)
async def update_product_update(product_id: str, data: dict):
"""Update database, then UPDATE cache."""
# 1. Update database
product = await db.fetch_one(
"UPDATE products SET price = $2 WHERE id = $1 RETURNING *",
product_id, data['price']
)
# 2. Update cache with new value
# RISK: Race condition if concurrent updates!
await cache.set(f"product:{product_id}", dict(product), ttl=300)
return dict(product)
When Write-Update Is Safe:
# Write-Update is safe when:
# 1. Single writer (no concurrent updates)
# 2. Using atomic operations
# 3. Data doesn't have complex dependencies
# Example: Incrementing a counter (atomic operation)
async def increment_view_count(product_id: str):
"""Atomic increment - safe for concurrent updates."""
# Increment in both places atomically
await db.execute(
"UPDATE products SET view_count = view_count + 1 WHERE id = $1",
product_id
)
# INCR is atomic in Redis
await cache.incr(f"product:{product_id}:views")
Recommendation:
DEFAULT: Write-Invalidate (delete cache on update)
Use Write-Update only when:
✓ Single writer per key
✓ Atomic operations (INCR, HINCRBY)
✓ Performance critical AND race risk is low
✓ You've carefully analyzed the race conditions
Chapter 3: Hybrid Strategies
3.1 TTL + Event-Driven (The Safety Net Pattern)
Combine TTL with event-driven for best of both worlds.
SAFETY NET PATTERN
Primary: Event-driven invalidation (fast, precise)
Fallback: Short TTL (catches missed events)
Event published → Cache invalidated in ~100ms
Event lost → Cache expires in 5 minutes anyway
You get:
- Near real-time freshness (99% of the time)
- Guaranteed maximum staleness (TTL)
- Self-healing (even if events fail)
Implementation:
# Safety Net Pattern: Event-driven + TTL
class SafetyNetCache:
"""
Event-driven invalidation with TTL safety net.
- Events provide fast invalidation
- TTL ensures stale data eventually expires
- Best of both worlds
"""
def __init__(self, redis_client, kafka_producer, safety_ttl: int = 300):
self.redis = redis_client
self.kafka = kafka_producer
self.safety_ttl = safety_ttl # Maximum staleness
async def set(self, key: str, value: any) -> None:
"""Cache with safety TTL."""
await self.redis.setex(
key,
self.safety_ttl, # Even without invalidation, expires in 5 min
json.dumps(value)
)
async def invalidate(self, key: str) -> None:
"""Explicit invalidation via event."""
await self.redis.delete(key)
async def update_product(self, product_id: str, data: dict):
"""Update with event-driven invalidation."""
# Update database
product = await self.db.fetch_one(
"UPDATE products SET ... WHERE id = $1 RETURNING *",
product_id
)
# Publish invalidation event
await self.kafka.send("cache-invalidation", {
"type": "invalidate",
"keys": [f"product:{product_id}"],
"timestamp": datetime.utcnow().isoformat()
})
# Even if event is lost, TTL ensures max 5 min staleness
return dict(product)
# Invalidation consumer
async def handle_invalidation_event(event: dict):
"""Process invalidation events."""
for key in event.get("keys", []):
await redis.delete(key)
logger.info(f"Invalidated {key}")
3.2 Different Strategies for Different Data
MULTI-STRATEGY APPROACH
Product Catalog:
├── product.name, product.description
│ └── TTL: 1 hour (rarely changes)
│
├── product.price
│ └── Event-driven + TTL 5 min (changes occasionally, freshness matters)
│
├── product.inventory
│ └── Event-driven + TTL 30 sec (changes frequently, accuracy critical)
│
└── product.images
└── TTL: 24 hours (almost never changes)
Implementation:
# Multi-Strategy Cache Configuration
from dataclasses import dataclass
from enum import Enum
from typing import Optional, List
class InvalidationStrategy(Enum):
TTL_ONLY = "ttl_only"
EVENT_DRIVEN = "event_driven"
VERSIONED = "versioned"
WRITE_THROUGH = "write_through"
@dataclass
class CachePolicy:
"""Defines caching behavior for a data type."""
strategy: InvalidationStrategy
ttl: int
event_topics: List[str] = None
version_key: str = None
# Define policies for different data types
CACHE_POLICIES = {
"product_details": CachePolicy(
strategy=InvalidationStrategy.TTL_ONLY,
ttl=3600 # 1 hour - rarely changes
),
"product_price": CachePolicy(
strategy=InvalidationStrategy.EVENT_DRIVEN,
ttl=300, # 5 min safety net
event_topics=["price-changes", "flash-sales"]
),
"product_inventory": CachePolicy(
strategy=InvalidationStrategy.EVENT_DRIVEN,
ttl=30, # 30 sec safety net - changes frequently
event_topics=["inventory-updates"]
),
"user_session": CachePolicy(
strategy=InvalidationStrategy.TTL_ONLY,
ttl=1800 # 30 min sliding expiration
),
"homepage_content": CachePolicy(
strategy=InvalidationStrategy.VERSIONED,
ttl=86400, # 24 hours
version_key="homepage_version"
),
}
class SmartCache:
"""
Cache that applies different strategies based on data type.
"""
def __init__(self, redis_client, policies: dict = CACHE_POLICIES):
self.redis = redis_client
self.policies = policies
def get_policy(self, data_type: str) -> CachePolicy:
"""Get caching policy for data type."""
return self.policies.get(data_type, CachePolicy(
strategy=InvalidationStrategy.TTL_ONLY,
ttl=300 # Default 5 min
))
async def get(self, data_type: str, key: str) -> Optional[dict]:
"""Get from cache using appropriate strategy."""
policy = self.get_policy(data_type)
if policy.strategy == InvalidationStrategy.VERSIONED:
return await self._get_versioned(data_type, key, policy)
else:
return await self._get_simple(data_type, key)
async def set(self, data_type: str, key: str, value: dict) -> None:
"""Set in cache using appropriate strategy."""
policy = self.get_policy(data_type)
if policy.strategy == InvalidationStrategy.VERSIONED:
await self._set_versioned(data_type, key, value, policy)
else:
await self._set_simple(data_type, key, value, policy.ttl)
async def _get_simple(self, data_type: str, key: str) -> Optional[dict]:
cache_key = f"{data_type}:{key}"
cached = await self.redis.get(cache_key)
return json.loads(cached) if cached else None
async def _set_simple(self, data_type: str, key: str, value: dict, ttl: int):
cache_key = f"{data_type}:{key}"
await self.redis.setex(cache_key, ttl, json.dumps(value))
async def _get_versioned(self, data_type: str, key: str, policy: CachePolicy):
version = await self.redis.get(policy.version_key) or "1"
cache_key = f"{data_type}:{key}:v{version}"
cached = await self.redis.get(cache_key)
return json.loads(cached) if cached else None
async def _set_versioned(self, data_type: str, key: str, value: dict, policy: CachePolicy):
version = await self.redis.get(policy.version_key) or "1"
cache_key = f"{data_type}:{key}:v{version}"
await self.redis.setex(cache_key, policy.ttl, json.dumps(value))
3.3 Strategy Comparison
| Strategy | Freshness | Complexity | Use Case |
|---|---|---|---|
| TTL only | Minutes | Low | Static content, tolerant reads |
| Event-driven | Seconds | High | Prices, inventory, critical data |
| Versioned | Seconds | Medium | Bulk updates, deployments |
| TTL + Events | Seconds | Medium-High | Best balance for most systems |
| Write-through | Immediate | Medium | Critical single-writer data |
Part II: Implementation
Chapter 4: Building an Invalidation System
4.1 Requirements
A production invalidation system needs:
- Multiple strategies — Different data needs different freshness
- Event handling — Process invalidation events reliably
- Bulk operations — Invalidate many keys efficiently
- Monitoring — Track invalidation success/failure
- Failure handling — Don't lose invalidations
4.2 Production Implementation
# Production Cache Invalidation System
import asyncio
import json
import logging
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List, Optional, Set
from enum import Enum
logger = logging.getLogger(__name__)
# =============================================================================
# Configuration
# =============================================================================
@dataclass
class InvalidationConfig:
"""Configuration for invalidation system."""
# Event processing
event_topic: str = "cache-invalidation"
consumer_group: str = "cache-invalidator"
batch_size: int = 100
# Safety
max_keys_per_invalidation: int = 1000
enable_pattern_delete: bool = False # Dangerous at scale
# Monitoring
enable_metrics: bool = True
slow_invalidation_threshold_ms: float = 100
# =============================================================================
# Invalidation Events
# =============================================================================
class InvalidationType(Enum):
DELETE = "delete" # Delete specific keys
PATTERN = "pattern" # Delete keys matching pattern
VERSION = "version" # Increment version
REFRESH = "refresh" # Delete and pre-warm
@dataclass
class InvalidationEvent:
"""Event requesting cache invalidation."""
type: InvalidationType
keys: List[str] = field(default_factory=list)
pattern: Optional[str] = None
version_key: Optional[str] = None
source: str = "unknown"
timestamp: datetime = field(default_factory=datetime.utcnow)
correlation_id: Optional[str] = None
def to_dict(self) -> dict:
return {
"type": self.type.value,
"keys": self.keys,
"pattern": self.pattern,
"version_key": self.version_key,
"source": self.source,
"timestamp": self.timestamp.isoformat(),
"correlation_id": self.correlation_id
}
@classmethod
def from_dict(cls, data: dict) -> 'InvalidationEvent':
return cls(
type=InvalidationType(data["type"]),
keys=data.get("keys", []),
pattern=data.get("pattern"),
version_key=data.get("version_key"),
source=data.get("source", "unknown"),
timestamp=datetime.fromisoformat(data["timestamp"]),
correlation_id=data.get("correlation_id")
)
# =============================================================================
# Invalidation Publisher (Use in application services)
# =============================================================================
class InvalidationPublisher:
"""
Publishes cache invalidation events.
Use this in your application services when data changes.
"""
def __init__(self, kafka_producer, config: InvalidationConfig):
self.kafka = kafka_producer
self.config = config
async def invalidate_keys(
self,
keys: List[str],
source: str = "application",
correlation_id: str = None
) -> None:
"""Invalidate specific cache keys."""
if len(keys) > self.config.max_keys_per_invalidation:
# Split into batches
for i in range(0, len(keys), self.config.max_keys_per_invalidation):
batch = keys[i:i + self.config.max_keys_per_invalidation]
await self._publish(InvalidationEvent(
type=InvalidationType.DELETE,
keys=batch,
source=source,
correlation_id=correlation_id
))
else:
await self._publish(InvalidationEvent(
type=InvalidationType.DELETE,
keys=keys,
source=source,
correlation_id=correlation_id
))
async def invalidate_pattern(
self,
pattern: str,
source: str = "application"
) -> None:
"""Invalidate keys matching pattern (use sparingly!)."""
if not self.config.enable_pattern_delete:
raise ValueError("Pattern delete is disabled in config")
await self._publish(InvalidationEvent(
type=InvalidationType.PATTERN,
pattern=pattern,
source=source
))
async def increment_version(
self,
version_key: str,
source: str = "application"
) -> None:
"""Increment a version key for versioned invalidation."""
await self._publish(InvalidationEvent(
type=InvalidationType.VERSION,
version_key=version_key,
source=source
))
async def _publish(self, event: InvalidationEvent) -> None:
"""Publish event to Kafka."""
await self.kafka.send(
self.config.event_topic,
value=json.dumps(event.to_dict()).encode()
)
logger.debug(f"Published invalidation event: {event.type.value}")
# =============================================================================
# Invalidation Consumer (Background service)
# =============================================================================
class InvalidationConsumer:
"""
Consumes and processes cache invalidation events.
Run this as a separate service or background worker.
"""
def __init__(
self,
kafka_consumer,
redis_client,
config: InvalidationConfig,
metrics_client = None
):
self.kafka = kafka_consumer
self.redis = redis_client
self.config = config
self.metrics = metrics_client
self._running = False
async def start(self):
"""Start consuming invalidation events."""
self._running = True
logger.info("Invalidation consumer started")
batch = []
async for message in self.kafka:
if not self._running:
break
try:
event = InvalidationEvent.from_dict(
json.loads(message.value.decode())
)
batch.append(event)
# Process in batches for efficiency
if len(batch) >= self.config.batch_size:
await self._process_batch(batch)
batch = []
await self.kafka.commit()
except Exception as e:
logger.error(f"Failed to process event: {e}")
# Individual event failure - continue processing
async def stop(self):
"""Stop the consumer."""
self._running = False
logger.info("Invalidation consumer stopped")
async def _process_batch(self, events: List[InvalidationEvent]) -> None:
"""Process a batch of invalidation events."""
start_time = datetime.utcnow()
# Group by type for efficient processing
delete_keys: Set[str] = set()
patterns: List[str] = []
version_keys: List[str] = []
for event in events:
if event.type == InvalidationType.DELETE:
delete_keys.update(event.keys)
elif event.type == InvalidationType.PATTERN:
patterns.append(event.pattern)
elif event.type == InvalidationType.VERSION:
version_keys.append(event.version_key)
# Process deletions in bulk
if delete_keys:
await self._delete_keys(list(delete_keys))
# Process patterns (one at a time - expensive!)
for pattern in patterns:
await self._delete_pattern(pattern)
# Process version increments
for version_key in version_keys:
await self._increment_version(version_key)
# Record metrics
duration_ms = (datetime.utcnow() - start_time).total_seconds() * 1000
if self.metrics:
self.metrics.timing("invalidation.batch_duration_ms", duration_ms)
self.metrics.increment("invalidation.keys_deleted", len(delete_keys))
self.metrics.increment("invalidation.events_processed", len(events))
if duration_ms > self.config.slow_invalidation_threshold_ms:
logger.warning(f"Slow invalidation batch: {duration_ms:.2f}ms")
async def _delete_keys(self, keys: List[str]) -> int:
"""Delete multiple keys efficiently."""
if not keys:
return 0
# Use pipeline for bulk delete
pipe = self.redis.pipeline()
for key in keys:
pipe.delete(key)
results = await pipe.execute()
deleted = sum(1 for r in results if r)
logger.debug(f"Deleted {deleted}/{len(keys)} keys")
return deleted
async def _delete_pattern(self, pattern: str) -> int:
"""Delete keys matching pattern (use sparingly!)."""
# SCAN to find matching keys (non-blocking)
deleted = 0
cursor = 0
while True:
cursor, keys = await self.redis.scan(
cursor=cursor,
match=pattern,
count=100
)
if keys:
await self.redis.delete(*keys)
deleted += len(keys)
if cursor == 0:
break
logger.info(f"Pattern delete '{pattern}' removed {deleted} keys")
return deleted
async def _increment_version(self, version_key: str) -> int:
"""Increment a version key."""
new_version = await self.redis.incr(version_key)
logger.debug(f"Incremented {version_key} to {new_version}")
return new_version
# =============================================================================
# Helper: Invalidation Decorator
# =============================================================================
def invalidate_cache(*keys: str, publisher: InvalidationPublisher = None):
"""
Decorator that invalidates cache after function execution.
Usage:
@invalidate_cache("product:{product_id}", "category:{category_id}")
async def update_product(product_id: str, category_id: str, data: dict):
...
"""
def decorator(func):
async def wrapper(*args, **kwargs):
result = await func(*args, **kwargs)
# Format keys with function arguments
formatted_keys = []
for key_template in keys:
try:
formatted_key = key_template.format(**kwargs)
formatted_keys.append(formatted_key)
except KeyError:
logger.warning(f"Could not format key template: {key_template}")
if formatted_keys and publisher:
await publisher.invalidate_keys(
formatted_keys,
source=func.__name__
)
return result
return wrapper
return decorator
# =============================================================================
# Usage Example
# =============================================================================
# In your product service:
class ProductService:
def __init__(self, db, cache, invalidation: InvalidationPublisher):
self.db = db
self.cache = cache
self.invalidation = invalidation
async def update_product(self, product_id: str, data: dict) -> dict:
# Get old product for category info
old_product = await self.db.fetch_one(
"SELECT category_id FROM products WHERE id = $1",
product_id
)
# Update database
product = await self.db.fetch_one(
"UPDATE products SET ... WHERE id = $1 RETURNING *",
product_id
)
# Invalidate all related caches
await self.invalidation.invalidate_keys([
f"product:{product_id}",
f"product_page:{product_id}",
f"category:{old_product['category_id']}",
], source="ProductService.update_product")
return dict(product)
async def bulk_import(self, products: List[dict]) -> int:
"""Import many products - use version increment."""
# Import to database
count = await self._bulk_insert(products)
# Invalidate all product caches via version
await self.invalidation.increment_version("product_cache_version")
return count
Chapter 5: Edge Cases and Error Handling
5.1 Edge Case 1: Invalidation Event Lost
SCENARIO: Kafka loses the invalidation event
Timeline:
T0: Product price updated in DB ($100 → $79)
T1: Invalidation event published to Kafka
T2: Kafka broker fails, event lost!
T3: Cache still has old price ($100)
T4: Users see wrong price until TTL expires
SOLUTION: Safety net TTL
Even with event-driven invalidation, always set a TTL:
- Events handle 99% of invalidations (fast)
- TTL handles the 1% where events fail (safety)
Maximum staleness = min(event_latency, TTL)
5.2 Edge Case 2: Thundering Herd After Invalidation
SCENARIO: Popular item cache invalidated during peak traffic
Timeline:
T0: Flash sale starts, price updated
T1: Cache invalidated for product:123
T2: 10,000 concurrent requests for product:123
T3: All 10,000 hit database simultaneously!
T4: Database overwhelmed
SOLUTION: Request coalescing (covered in Day 3)
For now, awareness is key:
- Invalidation can trigger thundering herd
- Popular items need extra protection
- Consider "soft" invalidation (mark stale, refresh async)
5.3 Edge Case 3: Cache and Database Inconsistency
SCENARIO: Invalidation happens before DB transaction commits
Dangerous order:
T0: Begin transaction
T1: Update product in DB (not committed)
T2: Invalidate cache
T3: Another request reads DB (sees OLD value)
T4: Populates cache with OLD value
T5: Transaction commits (DB now has NEW value)
T6: Cache has OLD value, DB has NEW value!
SOLUTION: Invalidate AFTER commit
async def update_product(product_id: str, data: dict):
async with db.transaction() as tx:
product = await tx.fetch_one(
"UPDATE products SET ... WHERE id = $1 RETURNING *",
product_id
)
# Transaction committed!
# NOW invalidate cache
await cache.delete(f"product:{product_id}")
return dict(product)
5.4 Edge Case 4: Partial Invalidation Failure
SCENARIO: Need to invalidate 5 keys, 2 fail
Keys to invalidate:
- product:123 ✓ Success
- product_page:123 ✓ Success
- category:456 ✗ Redis timeout
- search:results ✗ Redis timeout
- deals:page ✓ Success
SOLUTION: Retry with idempotency
Invalidation is idempotent - safe to retry!
DEL on a non-existent key = success
async def invalidate_with_retry(keys: List[str], max_retries: int = 3):
remaining = set(keys)
for attempt in range(max_retries):
failed = []
for key in remaining:
try:
await redis.delete(key)
except Exception as e:
failed.append(key)
logger.warning(f"Invalidation failed for {key}: {e}")
remaining = set(failed)
if not remaining:
return # All succeeded
# Exponential backoff
await asyncio.sleep(0.1 * (2 ** attempt))
# Some still failed - alert!
logger.error(f"Failed to invalidate keys after {max_retries} attempts: {remaining}")
# Consider: DLQ, manual intervention alert
5.5 Error Handling Matrix
| Error | Impact | Handling | Prevention |
|---|---|---|---|
| Event lost | Stale data until TTL | TTL safety net | Kafka replication |
| Redis timeout | Invalidation delayed | Retry with backoff | Redis cluster |
| Pattern delete slow | Blocks other invalidations | Avoid patterns, batch instead | Don't use patterns |
| Version key lost | Wrong version used | Rebuild from DB | Persist versions |
| Consumer crash | Events pile up | Auto-restart, resume from offset | Health checks |
Part III: Real-World Application
Chapter 7: How Big Tech Does It
7.1 Case Study: Facebook — Invalidation at Scale
FACEBOOK'S CACHE INVALIDATION
Scale:
- Billions of cache operations per second
- Distributed across multiple datacenters
- Consistency across regions matters
System: McRouter + TAO
Architecture:
Write Request
│
▼
┌─────────────────┐
│ TAO (Graph │
│ Database) │
│ │
│ Write to MySQL │
│ + Invalidate │
└────────┬────────┘
│
│ Invalidation
▼
┌─────────────────┐
│ McRouter │
│ (Memcached │
│ Router) │
│ │
│ Broadcasts to │
│ all regions │
└─────────────────┘
Key Innovation: LEASE-BASED INVALIDATION
Problem: Race between invalidation and cache fill
Solution: Leases
1. On cache miss, get a "lease" (token)
2. Load from database
3. Set cache with lease token
4. If lease was invalidated, set fails!
This prevents stale data from being cached after invalidation.
7.2 Case Study: Netflix — Event-Driven Invalidation
NETFLIX CACHE INVALIDATION
System: EVCache + Event-Driven
Architecture:
Content Update
│
▼
┌─────────────────┐
│ Content │
│ Service │
│ │
│ Update DB │
│ Publish Event │
└────────┬────────┘
│
│ Kafka Event
▼
┌─────────────────┐
│ Cache │
│ Invalidator │
│ │
│ Subscribe to │
│ content events │
│ │
│ Invalidate │
│ EVCache │
└─────────────────┘
Key Features:
1. ZONE-AWARE INVALIDATION
Each availability zone has its own cache
Invalidation sent to ALL zones
2. TTL AS SAFETY NET
All entries have 24-hour TTL
Events handle 99.9% of invalidations
TTL catches the rest
3. VERSIONED CONTENT
Content has version numbers
Cache key includes version
New version = automatic miss
7.3 Case Study: Amazon — Granular Invalidation
AMAZON PRODUCT CACHE INVALIDATION
Challenge:
- Different parts of product change at different rates
- Inventory: every purchase
- Price: hourly/daily
- Description: weekly/monthly
Solution: DECOMPOSED CACHING
Instead of one cache entry per product:
product:123:details TTL: 24 hours
product:123:price TTL: 5 min + events
product:123:inventory TTL: 30 sec + events
product:123:reviews TTL: 1 hour
Benefits:
- Inventory change doesn't invalidate description
- Price change doesn't invalidate reviews
- Each piece has appropriate freshness
Implementation:
async def get_product_page(product_id: str) -> dict:
# Parallel fetch of decomposed cache
details, price, inventory, reviews = await asyncio.gather(
cache.get(f"product:{product_id}:details"),
cache.get(f"product:{product_id}:price"),
cache.get(f"product:{product_id}:inventory"),
cache.get(f"product:{product_id}:reviews"),
)
# Load missing pieces from DB
# ...
return {
"details": details,
"price": price,
"inventory": inventory,
"reviews": reviews
}
7.4 Summary: Industry Patterns
| Company | Strategy | Key Innovation |
|---|---|---|
| Lease-based | Prevents stale fills | |
| Netflix | Event-driven + TTL | Zone-aware |
| Amazon | Decomposed caching | Granular invalidation |
| Write-through | Consistency for timeline | |
| Uber | Short TTL | Real-time location |
Chapter 8: Common Mistakes to Avoid
8.1 Mistake 1: Forgetting Related Caches
❌ WRONG: Only invalidate direct cache
async def update_product(product_id: str, data: dict):
await db.update(product_id, data)
await cache.delete(f"product:{product_id}")
# FORGOT: category listing, search results, homepage!
✅ CORRECT: Invalidate all related caches
async def update_product(product_id: str, data: dict):
# Get current product for related info
old_product = await db.get(product_id)
# Update database
await db.update(product_id, data)
# Invalidate ALL related caches
await invalidation.invalidate_keys([
f"product:{product_id}",
f"product_page:{product_id}",
f"category:{old_product['category_id']}",
f"brand:{old_product['brand_id']}",
# If price changed, also deals page
*(["deals:page"] if data.get('price') else []),
])
8.2 Mistake 2: Invalidating Before Commit
❌ WRONG: Invalidate inside transaction
async def update_product(product_id: str, data: dict):
async with db.transaction() as tx:
await tx.update(product_id, data)
await cache.delete(f"product:{product_id}") # BAD!
# If transaction rolls back, cache is already invalidated
# Another request might cache OLD value
✅ CORRECT: Invalidate after commit
async def update_product(product_id: str, data: dict):
async with db.transaction() as tx:
await tx.update(product_id, data)
# Transaction committed - NOW invalidate
await cache.delete(f"product:{product_id}")
8.3 Mistake 3: Using Pattern Delete at Scale
❌ WRONG: Pattern delete in hot path
async def update_category(category_id: str, data: dict):
await db.update(category_id, data)
# This scans ALL keys in Redis!
await redis.delete_pattern(f"product:*:category:{category_id}")
# At scale: millions of keys scanned, Redis blocked
✅ CORRECT: Track keys explicitly
# Maintain a set of keys per category
async def cache_product(product_id: str, category_id: str, data: dict):
await cache.set(f"product:{product_id}", data)
# Track this product in category's key set
await redis.sadd(f"category_products:{category_id}", f"product:{product_id}")
async def invalidate_category(category_id: str):
# Get tracked keys
product_keys = await redis.smembers(f"category_products:{category_id}")
# Delete specific keys (fast!)
if product_keys:
await redis.delete(*product_keys)
await redis.delete(f"category_products:{category_id}")
8.4 Mistake 4: No TTL Safety Net
❌ WRONG: Event-driven only, no TTL
async def cache_product(product_id: str, data: dict):
# No TTL! Relies entirely on events
await redis.set(f"product:{product_id}", json.dumps(data))
# If event is lost, data is stale FOREVER
✅ CORRECT: Always have TTL safety net
async def cache_product(product_id: str, data: dict):
# TTL as safety net - even if event lost, max staleness is 5 min
await redis.setex(
f"product:{product_id}",
300, # 5 minute safety TTL
json.dumps(data)
)
8.5 Mistake Checklist
Before deploying invalidation strategy:
- Related caches identified — All caches affected by each data change
- Invalidation after commit — Never inside transaction
- TTL safety net — Even with events, have TTL fallback
- No pattern deletes — Use explicit key tracking instead
- Retry logic — Invalidation should retry on failure
- Monitoring — Track invalidation latency and failures
- Documented dependencies — Map data → cache relationships
Part IV: Interview Preparation
Chapter 9: Interview Tips and Phrases
9.1 When to Discuss Invalidation
Bring up invalidation when:
- Designing any caching layer
- Data freshness is mentioned as a requirement
- Interviewer asks "what if data changes?"
- System involves prices, inventory, or real-time data
9.2 Key Phrases to Use
INTRODUCING INVALIDATION:
"For caching, the key question is invalidation. We need to
ensure users see fresh data while still benefiting from the
cache. Let me walk through our options."
EXPLAINING TTL:
"The simplest approach is TTL-based expiration. With a 5-minute
TTL, we accept up to 5 minutes of staleness in exchange for
simplicity. This works well for data like product descriptions
that rarely change."
EXPLAINING EVENT-DRIVEN:
"For price changes during a flash sale, TTL isn't fast enough.
I'd use event-driven invalidation: when the price updates, we
publish an event, and a consumer invalidates the cache within
milliseconds. Combined with a short TTL as a safety net, we
get near real-time freshness."
DISCUSSING TRADE-OFFS:
"The trade-off is freshness versus complexity. TTL is simple
but stale. Event-driven is fresh but requires infrastructure.
For this system, I'd use event-driven for prices and inventory,
but TTL-only for product descriptions."
HANDLING FAILURE QUESTIONS:
"If the invalidation event is lost—maybe Kafka has an issue—
the TTL safety net kicks in. Data might be stale for up to 5
minutes instead of milliseconds, but it won't be stale forever.
That's why I always combine events with TTL."
9.3 Questions to Ask Interviewer
- "How fresh does this data need to be? Seconds, minutes, hours?"
- "What's the acceptable user experience if they see slightly stale data?"
- "How frequently does this data change?"
- "Is there existing event infrastructure we can leverage?"
9.4 Common Follow-up Questions
| Question | Good Answer |
|---|---|
| "What if you need to invalidate millions of keys?" | "I'd use versioned keys with a global version. Incrementing one version key effectively invalidates everything. Old entries expire via TTL." |
| "How do you handle cache and DB getting out of sync?" | "Always invalidate after transaction commits. Use TTL as safety net. If we detect inconsistency, we can force refresh by bumping the version." |
| "What about thundering herd after invalidation?" | "That's a risk. For popular items, I'd use request coalescing—multiple requests for the same key share a single database fetch. Also consider 'soft' invalidation with async refresh." |
| "How do you test invalidation?" | "Integration tests that verify cache state after updates. Load tests that simulate concurrent updates. Chaos testing that kills the invalidation consumer." |
Chapter 10: Practice Problems
Problem 1: E-commerce Price Invalidation
Setup: Design cache invalidation for an e-commerce site with flash sales. When a flash sale starts, 10,000 products have price changes that must be visible to users within 1 second.
Requirements:
- 10,000 products in a flash sale
- Price must update within 1 second
- 100,000 users may be viewing products during sale start
- Can't overwhelm the database
Questions:
- How would you invalidate 10,000 product caches quickly?
- How do you prevent thundering herd when everyone reloads?
- What's your fallback if invalidation fails?
- Can you update cache instead of invalidate?
- Consider bulk operations
- Think about cache warming before the sale
Approach: Pre-compute + Write-Update + Fallback
async def start_flash_sale(sale_id: str, product_prices: Dict[str, Decimal]):
"""
Start flash sale with immediate price visibility.
"""
# BEFORE SALE: Pre-warm cache with new prices
# Scheduled 1 second before sale start
await pre_warm_flash_sale_prices(sale_id, product_prices)
# AT SALE START: Atomic version flip
await redis.set("active_flash_sale", sale_id)
# Products check for active sale and use sale price
async def pre_warm_flash_sale_prices(sale_id: str, prices: Dict):
"""Pre-warm cache with flash sale prices."""
pipe = redis.pipeline()
for product_id, price in prices.items():
# Store sale price separately
pipe.setex(
f"flash_price:{sale_id}:{product_id}",
3600, # 1 hour (sale duration)
str(price)
)
await pipe.execute() # Bulk write - fast!
async def get_product_price(product_id: str) -> Decimal:
"""Get price, checking for active flash sale."""
# Check for active flash sale
sale_id = await redis.get("active_flash_sale")
if sale_id:
sale_price = await redis.get(f"flash_price:{sale_id}:{product_id}")
if sale_price:
return Decimal(sale_price)
# No sale or product not in sale - use regular price
return await get_cached_price(product_id)
Thundering Herd Prevention:
- Pre-warm cache BEFORE sale starts
- No invalidation needed at sale start
- Just flip a single flag
Fallback:
- If pre-warm fails, fall back to database
- Short TTL (30 sec) ensures fresh prices eventually
- Can retry pre-warm in background
Problem 2: Social Media Feed Invalidation
Setup: Design invalidation for a social media feed cache. When a user posts, all their followers should see the post in their feeds. User has between 1 and 10 million followers.
Requirements:
- Post visible in followers' feeds within 30 seconds
- Can't invalidate 10 million cache entries synchronously
- Feed should always load fast (<100ms)
Questions:
- How do you handle invalidation for users with millions of followers?
- What's different about invalidating a "celebrity" vs regular user?
- How do you balance freshness and performance?
- Consider hybrid push/pull model
- Not all followers are online
- Think about which feeds actually need immediate updates
Approach: Hybrid Push/Pull with Activity-Based Invalidation
# Follower tiers
CELEBRITY_THRESHOLD = 10_000 # Followers
async def on_user_post(user_id: str, post: dict):
"""Handle new post - different strategies by follower count."""
follower_count = await get_follower_count(user_id)
if follower_count < CELEBRITY_THRESHOLD:
# Regular user: Push to all followers
await push_to_follower_feeds(user_id, post)
else:
# Celebrity: Don't push, pull on read
await mark_celebrity_post(user_id, post)
async def push_to_follower_feeds(user_id: str, post: dict):
"""Push post to all follower feed caches."""
followers = await get_followers(user_id)
# Only push to ACTIVE followers (online recently)
active_followers = [f for f in followers if await is_active(f)]
# Invalidate their feed caches
keys = [f"feed:{f}" for f in active_followers]
await invalidation.invalidate_keys(keys)
# Inactive followers will get fresh feed on next login
async def get_feed(user_id: str) -> List[dict]:
"""Get feed, merging cached + celebrity posts."""
# Get cached feed (pushed posts from regular users)
cached_feed = await cache.get(f"feed:{user_id}")
# Get celebrity posts (pulled on read)
celebrity_posts = await get_followed_celebrity_posts(user_id)
# Merge and sort
all_posts = (cached_feed or []) + celebrity_posts
all_posts.sort(key=lambda p: p['timestamp'], reverse=True)
return all_posts[:50] # Top 50 posts
Key Insights:
- Don't invalidate inactive users' caches
- Celebrity posts pulled on read (no push amplification)
- Regular users' posts pushed to active followers
- Hybrid approach scales to any follower count
Problem 3: Multi-Region Cache Invalidation
Setup: Your application runs in 3 regions (US, EU, Asia). Each region has its own Redis cluster. When data changes, all regions must invalidate within 5 seconds.
Requirements:
- Write happens in one region
- All 3 regions must invalidate
- Network between regions has ~100ms latency
- If one region's Redis is down, others should still invalidate
Questions:
- How do you propagate invalidation across regions?
- How do you handle a region being temporarily unreachable?
- How do you ensure consistency across regions?
- Consider async event propagation
- Each region needs independent processing
- Think about conflict resolution
Approach: Event Bus with Regional Consumers
# Global event bus (Kafka with multi-region replication)
async def invalidate_global(keys: List[str], source_region: str):
"""Publish invalidation to all regions."""
event = {
"type": "invalidate",
"keys": keys,
"source_region": source_region,
"timestamp": datetime.utcnow().isoformat(),
"id": str(uuid.uuid4())
}
# Publish to global topic (replicated to all regions)
await kafka.send("global-cache-invalidation", event)
# Each region runs its own consumer
class RegionalInvalidationConsumer:
def __init__(self, region: str, redis_client):
self.region = region
self.redis = redis_client
self.processed_ids = TTLCache(ttl=300) # Dedup
async def process(self, event: dict):
# Deduplication (event might arrive multiple times)
if event["id"] in self.processed_ids:
return
self.processed_ids[event["id"]] = True
# Invalidate in this region
await self.redis.delete(*event["keys"])
logger.info(f"Region {self.region} invalidated {len(event['keys'])} keys")
# Regional Redis failure handling
async def invalidate_with_fallback(redis_client, keys: List[str]):
try:
await redis_client.delete(*keys)
except RedisError as e:
logger.error(f"Regional Redis down: {e}")
# Store invalidation for retry when Redis recovers
await store_pending_invalidation(keys)
# Alert ops team
await alert("Regional cache invalidation failed")
Architecture:
Write in US-East
│
▼
Global Kafka Topic (replicated to all regions)
│
├────────────────────────────────────────┐
│ │
▼ ▼
US Consumer EU Consumer
│ │
▼ ▼
US Redis EU Redis
│
Asia Consumer
│
▼
Asia Redis
Key Insights:
- Global event bus handles cross-region propagation
- Each region processes independently
- Deduplication prevents double-processing
- Pending queue handles temporary Redis outage
- 5 second target achieved with Kafka replication
Chapter 11: Mock Interview Dialogue
Scenario: Design Invalidation for a News Site
Interviewer: "We're building a news site. Articles are written by journalists, edited, then published. Once published, they might be updated with corrections. How would you handle cache invalidation?"
You: "Good question. Let me understand the data flow first.
So an article goes through: draft → editing → published → possibly updated. Users only see published articles. How frequently are published articles updated?"
Interviewer: "Breaking news might get updates every few minutes. Regular articles maybe once a day if there's a correction."
You: "Got it. And what's the traffic pattern? I assume there's a power law—some articles get millions of views, most get very few?"
Interviewer: "Exactly. Top 100 articles might get 90% of traffic."
You: "Perfect. Here's my approach:
For invalidation strategy, I'd use event-driven invalidation with TTL safety net.
When an article is updated:
- Editor saves changes to database
- CMS publishes an 'article.updated' event to Kafka
- Cache invalidation consumer receives event
- Deletes the article cache entry
For the TTL, I'd use 5 minutes. Most articles don't change often, so a 5-minute stale window is acceptable. But the event-driven invalidation means most updates are visible within seconds.
Interviewer: "What about the homepage that shows the latest articles?"
You: "The homepage is different. It's not one article—it's a computed view of 'top 10 articles right now.' I'd handle it differently:
For the homepage, I'd use short TTL with background refresh. TTL of 60 seconds, and a background job that refreshes the cache every 30 seconds. This means:
- Homepage is always served from cache (fast)
- Never more than 30 seconds stale
- No thundering herd because background job refreshes it
For the 'breaking news' banner specifically, I might use even shorter TTL—10 seconds—or even skip caching entirely since it's a single small query.
Interviewer: "What if an article is published with a typo and the editor fixes it immediately? The cache has the typo for 5 minutes?"
You: "That's where event-driven invalidation helps. The flow is:
- Article published with typo
- Cache populated with typo
- Editor notices, fixes typo
- 'article.updated' event published
- Cache invalidated within milliseconds
- Next reader gets corrected version
The 5-minute TTL is just a safety net for if the event is lost. In normal operation, the event invalidates the cache almost immediately.
To make it even faster, I could update the cache directly instead of invalidating:
async def on_article_updated(event):
article_id = event['article_id']
# Option A: Invalidate (requires DB hit on next read)
await cache.delete(f"article:{article_id}")
# Option B: Update (no DB hit, faster for readers)
article = event['data'] # Include full article in event
await cache.set(f"article:{article_id}", article, ttl=300)
I'd use Option B for breaking news articles where every second counts."
Interviewer: "How do you handle a situation where the CMS is down but you need to invalidate cache?"
You: "That's an important failure scenario. A few options:
-
TTL as fallback — Even without CMS, cache expires in 5 minutes. Not ideal but bounded staleness.
-
Manual invalidation tool — Build an admin endpoint that ops can call directly:
POST /admin/invalidate {"keys": ["article:12345"]} -
Version-based emergency invalidation — If we need to invalidate ALL articles (major bug, security issue):
# Increment global version await redis.incr("article_cache_version") # All article cache keys now miss
The version approach is like a nuclear option—use it rarely, but it's good to have."
Summary
DAY 2 KEY TAKEAWAYS
CORE PROBLEM:
• Cache invalidation is hard because cache doesn't know when data changes
• Balance freshness, simplicity, and performance
• Different data needs different strategies
INVALIDATION STRATEGIES:
TTL-Based:
• Simplest approach
• Data stale for up to TTL duration
• Use for: Static content, low-change data
Event-Driven:
• Near real-time freshness
• Requires event infrastructure
• Use for: Prices, inventory, critical data
Versioned Keys:
• Increment version to invalidate
• Good for bulk invalidation
• Use for: Deployments, bulk imports
Hybrid (TTL + Events):
• Best of both worlds
• Events for speed, TTL for safety
• Use for: Most production systems
KEY DECISIONS:
• Write-invalidate (delete) safer than write-update
• Always invalidate AFTER transaction commits
• Always have TTL safety net with event-driven
COMMON MISTAKES:
• Forgetting related caches
• Invalidating inside transaction
• Pattern delete at scale
• No TTL safety net
DEFAULT CHOICE:
• Event-driven + 5 minute TTL safety net
• Write-invalidate (delete on change)
• Decomposed caching for different freshness needs
📚 Further Reading
Official Documentation
- Redis Keyspace Notifications: https://redis.io/docs/manual/keyspace-notifications/
- Kafka Documentation: https://kafka.apache.org/documentation/
Engineering Blogs
- Facebook TAO: "TAO: Facebook's Distributed Data Store for the Social Graph"
- Netflix EVCache: "Caching for a Global Netflix"
- Uber Schemaless: "Designing Schemaless, Uber Engineering's Scalable Datastore"
Papers
- "Scaling Memcache at Facebook": https://research.facebook.com/publications/scaling-memcache-at-facebook/
Books
- "Designing Data-Intensive Applications" by Martin Kleppmann — Chapter 5
End of Day 2: Invalidation Strategies
Tomorrow: Day 3 — Thundering Herd. We've learned how to invalidate caches. Tomorrow, we tackle what happens when thousands of requests simultaneously hit a cache miss. You'll learn locking, request coalescing, and probabilistic early expiration—the tools to protect your database from cache stampedes.