Himanshu Kukreja
0%
LearnSystem DesignWeek 4Cache Invalidation
Day 02

Week 4 — Day 2: Invalidation Strategies

System Design Mastery Series


Preface

Yesterday, you learned the four caching patterns. You can now put data into a cache efficiently.

Today, we tackle the harder problem: getting stale data out.

THE INCIDENT

Monday, 2:47 PM — Customer Support tickets spike

Ticket #4521: "Product shows $99, but I was charged $79"
Ticket #4522: "Price on page doesn't match cart"
Ticket #4523: "Flash sale price not showing"
Ticket #4524: "Wrong price displayed"
... 147 more tickets

Investigation:

  10:00 AM — Marketing starts flash sale
  10:00 AM — Database updated: price $99 → $79
  10:00 AM — Cache still has: price $99
  
  Cache TTL: 1 hour
  
  10:00 - 11:00 AM — Users see $99, charged $79
  11:00 AM — Cache expires, new price $79 appears
  
  Result:
    - 147 confused customers
    - Support team overwhelmed  
    - Trust damaged
    - "Why didn't the cache update?"

The cache did exactly what it was told.
The problem: nobody told it the price changed.

This is the cache invalidation problem — and it's notoriously hard.

Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. Today, you'll understand why invalidation is hard — and learn the strategies that actually work.


Part I: Foundations

Chapter 1: Why Invalidation Is Hard

1.1 The Core Problem

A cache is a copy of data. When the original changes, the copy becomes stale.

THE STALENESS PROBLEM

Source of Truth (Database):
  T0: product.price = $100
  T1: product.price = $79 (updated!)
  T2: product.price = $79
  T3: product.price = $79

Cache:
  T0: product.price = $100 (correct)
  T1: product.price = $100 (STALE!)
  T2: product.price = $100 (STALE!)
  T3: product.price = $100 (STALE!)

The cache doesn't know the database changed.
How do we tell it?

1.2 The Invalidation Trilemma

You can optimize for two of three properties:

THE INVALIDATION TRILEMMA

                    Freshness
                       /\
                      /  \
                     /    \
                    /      \
                   /        \
                  /__________\
           Simplicity    Performance


FRESHNESS: Data in cache matches database
SIMPLICITY: Easy to implement and maintain
PERFORMANCE: Low latency, high throughput

Trade-offs:

TTL-based:
  ✓ Simple
  ✓ Good performance
  ✗ Not always fresh (stale until TTL expires)

Event-driven:
  ✓ Fresh (near real-time)
  ✗ Complex (need event infrastructure)
  ✓ Good performance

Write-through:
  ✓ Always fresh
  ✗ Complex (coupling)
  ✗ Slower writes (must update cache)

1.3 When Staleness Matters

Not all data needs the same freshness:

Data Type Staleness Tolerance Why
Stock prices < 1 second Trading decisions
Inventory count < 30 seconds Overselling risk
Product price < 5 minutes Customer trust
Product description Hours Rarely changes
User profile Minutes to hours Low impact
Static content Days Never changes
DESIGN PRINCIPLE

Match invalidation strategy to staleness tolerance.

Don't use the same strategy for everything.
Product descriptions don't need real-time invalidation.
Inventory counts probably do.

1.4 Key Terminology

Term Definition
Stale data Cached data that differs from source of truth
TTL (Time-To-Live) Duration before cache entry expires
Invalidation Removing or updating stale cache entries
Cache miss Data not in cache, must fetch from source
Write-invalidate Delete cache entry when data changes
Write-update Update cache entry when data changes
Tombstone Marker indicating deleted data

Chapter 2: Invalidation Strategies

2.1 Strategy 1: Time-To-Live (TTL)

The simplest strategy: cache entries automatically expire after a set duration.

TTL-BASED INVALIDATION

Timeline:
  T0: Cache MISS, load from DB, cache with TTL=300s
  T0-T300: Cache HIT (even if DB changes!)
  T300: Cache entry expires
  T301: Cache MISS, reload from DB

How it works:
  cache.setex("product:123", 300, product_data)
                             ↑
                         TTL in seconds

After 300 seconds, key automatically deleted.
Next request triggers cache miss and reload.

Implementation:

# TTL-Based Invalidation

class TTLCache:
    """
    Simple TTL-based caching.
    
    Pros: Simple, self-healing
    Cons: Stale for up to TTL duration
    """
    
    def __init__(self, redis_client, default_ttl: int = 300):
        self.redis = redis_client
        self.default_ttl = default_ttl
    
    async def get_product(self, product_id: str) -> dict:
        cache_key = f"product:{product_id}"
        
        # Check cache
        cached = await self.redis.get(cache_key)
        if cached:
            return json.loads(cached)
        
        # Miss: load from database
        product = await self.db.fetch_one(
            "SELECT * FROM products WHERE id = $1",
            product_id
        )
        
        # Cache with TTL
        await self.redis.setex(
            cache_key,
            self.default_ttl,  # Expires in 300 seconds
            json.dumps(dict(product))
        )
        
        return dict(product)

Choosing TTL Values:

TTL SELECTION GUIDE

Question: "How stale is acceptable?"

Very fresh needed (seconds):
  - Inventory counts during sales
  - Real-time dashboards
  - Session data (sliding expiration)
  TTL: 10-60 seconds

Moderate freshness (minutes):
  - Product prices
  - User preferences
  - Search results
  TTL: 1-15 minutes

Low freshness OK (hours):
  - Product descriptions
  - Category listings
  - User profiles (non-critical)
  TTL: 1-24 hours

Static content (days):
  - Images, CSS, JS
  - Configuration that rarely changes
  - Historical data
  TTL: 1-30 days


FORMULA FOR TTL:

TTL = (acceptable_staleness) × (safety_factor)

Example:
  "Users can tolerate 5 minutes of stale prices"
  TTL = 5 minutes × 0.8 = 4 minutes
  
  Safety factor accounts for clock skew, network delays

Pros:

  • Extremely simple to implement
  • Self-healing (stale data eventually expires)
  • No external dependencies
  • Predictable behavior

Cons:

  • Data is stale for up to TTL duration
  • No control over when refresh happens
  • Short TTL = more cache misses = more DB load
  • Long TTL = longer staleness

2.2 Strategy 2: Event-Driven Invalidation

When data changes, publish an event. Consumers invalidate relevant cache entries.

EVENT-DRIVEN INVALIDATION

Write Path:
  1. Update database
  2. Publish "product.updated" event
  3. Cache invalidator consumes event
  4. Delete/update cache entry

Timeline:
  T0: DB update + publish event
  T0.1: Event consumed
  T0.2: Cache invalidated
  T0.3: Next read gets fresh data

Latency: ~100-500ms (near real-time)


                    ┌─────────────────┐
                    │  Product        │
  Update ──────────▶│  Service        │
                    │                 │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
       ┌─────────────┐              ┌─────────────┐
       │  Database   │              │   Kafka     │
       │  (write)    │              │  (event)    │
       └─────────────┘              └──────┬──────┘
                                           │
                                           ▼
                                   ┌─────────────┐
                                   │   Cache     │
                                   │ Invalidator │
                                   └──────┬──────┘
                                          │
                                          ▼
                                   ┌─────────────┐
                                   │   Redis     │
                                   │  (delete)   │
                                   └─────────────┘

Implementation:

# Event-Driven Invalidation

from dataclasses import dataclass
from typing import List, Optional
import json


@dataclass
class ProductUpdatedEvent:
    """Event published when a product changes."""
    product_id: str
    changed_fields: List[str]
    timestamp: str
    source: str  # Which service made the change


class ProductService:
    """
    Product service that publishes events on changes.
    """
    
    def __init__(self, db, kafka_producer, cache):
        self.db = db
        self.kafka = kafka_producer
        self.cache = cache
    
    async def update_product(self, product_id: str, updates: dict) -> dict:
        """Update product and publish event for cache invalidation."""
        
        # Update database
        product = await self.db.fetch_one(
            """
            UPDATE products 
            SET name = COALESCE($2, name),
                price = COALESCE($3, price),
                updated_at = NOW()
            WHERE id = $1
            RETURNING *
            """,
            product_id, updates.get('name'), updates.get('price')
        )
        
        # Publish event for cache invalidation
        event = ProductUpdatedEvent(
            product_id=product_id,
            changed_fields=list(updates.keys()),
            timestamp=datetime.utcnow().isoformat(),
            source="product-service"
        )
        
        await self.kafka.send(
            "product-events",
            key=product_id.encode(),
            value=json.dumps(event.__dict__).encode()
        )
        
        return dict(product)


class CacheInvalidationConsumer:
    """
    Consumes product events and invalidates cache.
    
    Runs as a separate service/process.
    """
    
    def __init__(self, kafka_consumer, redis_client):
        self.kafka = kafka_consumer
        self.redis = redis_client
    
    async def run(self):
        """Main consumer loop."""
        async for message in self.kafka:
            try:
                event = json.loads(message.value.decode())
                await self.handle_event(event)
                await self.kafka.commit()
            except Exception as e:
                logger.error(f"Failed to process event: {e}")
                # Send to DLQ (Week 3, Day 4)
    
    async def handle_event(self, event: dict):
        """Handle a product event."""
        event_type = event.get('type', 'product.updated')
        product_id = event['product_id']
        
        if event_type in ('product.updated', 'product.deleted'):
            # Invalidate product cache
            await self.redis.delete(f"product:{product_id}")
            
            # Also invalidate related caches
            await self.redis.delete(f"product_page:{product_id}")
            await self.redis.delete(f"category_products:{event.get('category_id')}")
            
            logger.info(f"Invalidated cache for product {product_id}")
        
        elif event_type == 'product.price_changed':
            # Price change might affect more things
            await self.redis.delete(f"product:{product_id}")
            await self.redis.delete(f"deals_page")  # Might appear in deals
            await self.redis.delete(f"search_results:*")  # Invalidate search caches

Handling Related Caches:

# Products appear in multiple caches - invalidate all of them

CACHE_DEPENDENCIES = {
    "product": [
        "product:{id}",           # Direct product cache
        "product_page:{id}",      # Rendered product page
        "category:{category_id}", # Category listing
        "search:*",               # Search results (pattern)
    ],
    "inventory": [
        "product:{id}",           # Product includes stock info
        "product_page:{id}",
        "availability:{id}",
    ],
    "price": [
        "product:{id}",
        "product_page:{id}",
        "deals_page",             # If product is on sale
        "cart:{user_id}",         # User's cart might show price
    ]
}


async def invalidate_product(product_id: str, change_type: str, metadata: dict):
    """Invalidate all caches affected by a product change."""
    
    patterns = CACHE_DEPENDENCIES.get(change_type, ["product:{id}"])
    
    for pattern in patterns:
        # Replace placeholders
        key = pattern.format(
            id=product_id,
            category_id=metadata.get('category_id', '*'),
            user_id='*'  # Wildcard for user-specific caches
        )
        
        if '*' in key:
            # Pattern delete (use with caution - expensive!)
            await delete_by_pattern(key)
        else:
            await redis.delete(key)

Pros:

  • Near real-time freshness
  • Precise control over what gets invalidated
  • Efficient (only invalidate what changed)
  • Scales well with proper event infrastructure

Cons:

  • Requires event infrastructure (Kafka, etc.)
  • More complex to implement
  • Event delivery isn't guaranteed (need DLQ)
  • Coupling between services

2.3 Strategy 3: Versioned Keys

Include a version in the cache key. When data changes, increment the version.

VERSIONED KEYS

Instead of:
  cache.set("product:123", data)

Use:
  cache.set("product:123:v5", data)

When product changes:
  - Increment version: v5 → v6
  - New key: "product:123:v6"
  - Old key "product:123:v5" still exists but never accessed
  - Old key expires via TTL


Timeline:
  T0: Version = 5, cache key = "product:123:v5"
  T1: Product updated, version = 6
  T2: Read uses key "product:123:v6" - MISS (new key)
  T3: Load from DB, cache as "product:123:v6"
  T4: Old "product:123:v5" expires naturally

Implementation:

# Versioned Key Invalidation

class VersionedCache:
    """
    Cache with versioned keys.
    
    Version is stored separately and incremented on updates.
    Old versions expire naturally via TTL.
    """
    
    def __init__(self, redis_client, default_ttl: int = 3600):
        self.redis = redis_client
        self.default_ttl = default_ttl
    
    async def get_version(self, entity: str, entity_id: str) -> int:
        """Get current version for an entity."""
        version_key = f"version:{entity}:{entity_id}"
        version = await self.redis.get(version_key)
        return int(version) if version else 1
    
    async def increment_version(self, entity: str, entity_id: str) -> int:
        """Increment version (effectively invalidates cache)."""
        version_key = f"version:{entity}:{entity_id}"
        new_version = await self.redis.incr(version_key)
        return new_version
    
    def _build_key(self, entity: str, entity_id: str, version: int) -> str:
        """Build versioned cache key."""
        return f"{entity}:{entity_id}:v{version}"
    
    async def get(self, entity: str, entity_id: str) -> Optional[dict]:
        """Get cached value using current version."""
        version = await self.get_version(entity, entity_id)
        cache_key = self._build_key(entity, entity_id, version)
        
        cached = await self.redis.get(cache_key)
        if cached:
            return json.loads(cached)
        return None
    
    async def set(self, entity: str, entity_id: str, value: dict) -> None:
        """Set cached value with current version."""
        version = await self.get_version(entity, entity_id)
        cache_key = self._build_key(entity, entity_id, version)
        
        await self.redis.setex(
            cache_key,
            self.default_ttl,
            json.dumps(value)
        )
    
    async def invalidate(self, entity: str, entity_id: str) -> int:
        """Invalidate by incrementing version."""
        return await self.increment_version(entity, entity_id)


# Usage
class ProductRepository:
    def __init__(self, db, cache: VersionedCache):
        self.db = db
        self.cache = cache
    
    async def get_product(self, product_id: str) -> dict:
        # Try cache with current version
        cached = await self.cache.get("product", product_id)
        if cached:
            return cached
        
        # Load from database
        product = await self.db.fetch_one(
            "SELECT * FROM products WHERE id = $1",
            product_id
        )
        
        # Cache with current version
        await self.cache.set("product", product_id, dict(product))
        return dict(product)
    
    async def update_product(self, product_id: str, data: dict) -> dict:
        # Update database
        product = await self.db.fetch_one(
            "UPDATE products SET ... WHERE id = $1 RETURNING *",
            product_id
        )
        
        # Invalidate by incrementing version
        # Next read will miss (new version key doesn't exist)
        await self.cache.invalidate("product", product_id)
        
        return dict(product)

Global Version for Bulk Invalidation:

# Invalidate ALL products at once (e.g., after bulk import)

class GlobalVersionCache:
    """Cache with global version for bulk invalidation."""
    
    async def get_global_version(self, entity: str) -> int:
        """Global version affects all entities of this type."""
        version = await self.redis.get(f"global_version:{entity}")
        return int(version) if version else 1
    
    async def increment_global_version(self, entity: str) -> int:
        """Invalidate ALL cached entities of this type."""
        return await self.redis.incr(f"global_version:{entity}")
    
    def _build_key(self, entity: str, entity_id: str, 
                   entity_version: int, global_version: int) -> str:
        """Key includes both entity and global version."""
        return f"{entity}:{entity_id}:v{entity_version}:g{global_version}"


# Usage: After bulk product import
await cache.increment_global_version("product")
# All product cache keys now miss (global version changed)

Pros:

  • No explicit deletion needed
  • Atomic invalidation (version increment is atomic)
  • Supports bulk invalidation via global version
  • Old data expires naturally

Cons:

  • More complex key management
  • Storage overhead (old versions remain until TTL)
  • Need to track version numbers
  • Extra Redis call to get version

2.4 Strategy 4: Write-Invalidate vs Write-Update

When data changes, should you delete or update the cache?

WRITE-INVALIDATE (Delete)

On write:
  1. Update database
  2. DELETE cache entry
  
Next read:
  3. Cache miss
  4. Load from database
  5. Populate cache


WRITE-UPDATE (Update)

On write:
  1. Update database
  2. SET cache entry with new value
  
Next read:
  3. Cache hit (already updated)


Which is better?

The Race Condition Problem with Write-Update:

RACE CONDITION WITH WRITE-UPDATE

Two concurrent requests update the same product:

Request A (price = $90):
  T1: Read from DB (price = $100)
  T2: Update DB to $90
  T4: Update cache to $90  ← Happens LATER

Request B (price = $80):
  T1.5: Read from DB (price = $100)
  T3: Update DB to $80
  T3.5: Update cache to $80  ← Happens FIRST

Timeline:
  T1:   A reads DB ($100)
  T1.5: B reads DB ($100)
  T2:   A writes DB ($90)
  T3:   B writes DB ($80)    ← DB now has $80 (correct)
  T3.5: B updates cache ($80)
  T4:   A updates cache ($90) ← Cache now has $90 (WRONG!)

Result:
  Database: $80 (correct - last write wins)
  Cache: $90 (STALE - from earlier update)


WITH WRITE-INVALIDATE:

  T3:   B writes DB ($80)
  T3.5: B deletes cache
  T4:   A deletes cache  ← Idempotent, no problem

Result:
  Database: $80
  Cache: empty (next read will load fresh)
  
No race condition!

Implementation Comparison:

# Write-Invalidate (Recommended)

async def update_product_invalidate(product_id: str, data: dict):
    """Update database, then DELETE cache."""
    
    # 1. Update database
    product = await db.fetch_one(
        "UPDATE products SET price = $2 WHERE id = $1 RETURNING *",
        product_id, data['price']
    )
    
    # 2. Delete cache (invalidate)
    await cache.delete(f"product:{product_id}")
    
    # Next read will miss and load fresh data
    return dict(product)


# Write-Update (Use with caution)

async def update_product_update(product_id: str, data: dict):
    """Update database, then UPDATE cache."""
    
    # 1. Update database
    product = await db.fetch_one(
        "UPDATE products SET price = $2 WHERE id = $1 RETURNING *",
        product_id, data['price']
    )
    
    # 2. Update cache with new value
    # RISK: Race condition if concurrent updates!
    await cache.set(f"product:{product_id}", dict(product), ttl=300)
    
    return dict(product)

When Write-Update Is Safe:

# Write-Update is safe when:
# 1. Single writer (no concurrent updates)
# 2. Using atomic operations
# 3. Data doesn't have complex dependencies

# Example: Incrementing a counter (atomic operation)
async def increment_view_count(product_id: str):
    """Atomic increment - safe for concurrent updates."""
    
    # Increment in both places atomically
    await db.execute(
        "UPDATE products SET view_count = view_count + 1 WHERE id = $1",
        product_id
    )
    
    # INCR is atomic in Redis
    await cache.incr(f"product:{product_id}:views")

Recommendation:

DEFAULT: Write-Invalidate (delete cache on update)

Use Write-Update only when:
  ✓ Single writer per key
  ✓ Atomic operations (INCR, HINCRBY)
  ✓ Performance critical AND race risk is low
  ✓ You've carefully analyzed the race conditions

Chapter 3: Hybrid Strategies

3.1 TTL + Event-Driven (The Safety Net Pattern)

Combine TTL with event-driven for best of both worlds.

SAFETY NET PATTERN

Primary: Event-driven invalidation (fast, precise)
Fallback: Short TTL (catches missed events)

Event published → Cache invalidated in ~100ms
Event lost → Cache expires in 5 minutes anyway

You get:
  - Near real-time freshness (99% of the time)
  - Guaranteed maximum staleness (TTL)
  - Self-healing (even if events fail)

Implementation:

# Safety Net Pattern: Event-driven + TTL

class SafetyNetCache:
    """
    Event-driven invalidation with TTL safety net.
    
    - Events provide fast invalidation
    - TTL ensures stale data eventually expires
    - Best of both worlds
    """
    
    def __init__(self, redis_client, kafka_producer, safety_ttl: int = 300):
        self.redis = redis_client
        self.kafka = kafka_producer
        self.safety_ttl = safety_ttl  # Maximum staleness
    
    async def set(self, key: str, value: any) -> None:
        """Cache with safety TTL."""
        await self.redis.setex(
            key,
            self.safety_ttl,  # Even without invalidation, expires in 5 min
            json.dumps(value)
        )
    
    async def invalidate(self, key: str) -> None:
        """Explicit invalidation via event."""
        await self.redis.delete(key)
    
    async def update_product(self, product_id: str, data: dict):
        """Update with event-driven invalidation."""
        
        # Update database
        product = await self.db.fetch_one(
            "UPDATE products SET ... WHERE id = $1 RETURNING *",
            product_id
        )
        
        # Publish invalidation event
        await self.kafka.send("cache-invalidation", {
            "type": "invalidate",
            "keys": [f"product:{product_id}"],
            "timestamp": datetime.utcnow().isoformat()
        })
        
        # Even if event is lost, TTL ensures max 5 min staleness
        return dict(product)


# Invalidation consumer
async def handle_invalidation_event(event: dict):
    """Process invalidation events."""
    for key in event.get("keys", []):
        await redis.delete(key)
        logger.info(f"Invalidated {key}")

3.2 Different Strategies for Different Data

MULTI-STRATEGY APPROACH

Product Catalog:
├── product.name, product.description
│   └── TTL: 1 hour (rarely changes)
│
├── product.price
│   └── Event-driven + TTL 5 min (changes occasionally, freshness matters)
│
├── product.inventory
│   └── Event-driven + TTL 30 sec (changes frequently, accuracy critical)
│
└── product.images
    └── TTL: 24 hours (almost never changes)

Implementation:

# Multi-Strategy Cache Configuration

from dataclasses import dataclass
from enum import Enum
from typing import Optional, List


class InvalidationStrategy(Enum):
    TTL_ONLY = "ttl_only"
    EVENT_DRIVEN = "event_driven"
    VERSIONED = "versioned"
    WRITE_THROUGH = "write_through"


@dataclass
class CachePolicy:
    """Defines caching behavior for a data type."""
    strategy: InvalidationStrategy
    ttl: int
    event_topics: List[str] = None
    version_key: str = None


# Define policies for different data types
CACHE_POLICIES = {
    "product_details": CachePolicy(
        strategy=InvalidationStrategy.TTL_ONLY,
        ttl=3600  # 1 hour - rarely changes
    ),
    
    "product_price": CachePolicy(
        strategy=InvalidationStrategy.EVENT_DRIVEN,
        ttl=300,  # 5 min safety net
        event_topics=["price-changes", "flash-sales"]
    ),
    
    "product_inventory": CachePolicy(
        strategy=InvalidationStrategy.EVENT_DRIVEN,
        ttl=30,  # 30 sec safety net - changes frequently
        event_topics=["inventory-updates"]
    ),
    
    "user_session": CachePolicy(
        strategy=InvalidationStrategy.TTL_ONLY,
        ttl=1800  # 30 min sliding expiration
    ),
    
    "homepage_content": CachePolicy(
        strategy=InvalidationStrategy.VERSIONED,
        ttl=86400,  # 24 hours
        version_key="homepage_version"
    ),
}


class SmartCache:
    """
    Cache that applies different strategies based on data type.
    """
    
    def __init__(self, redis_client, policies: dict = CACHE_POLICIES):
        self.redis = redis_client
        self.policies = policies
    
    def get_policy(self, data_type: str) -> CachePolicy:
        """Get caching policy for data type."""
        return self.policies.get(data_type, CachePolicy(
            strategy=InvalidationStrategy.TTL_ONLY,
            ttl=300  # Default 5 min
        ))
    
    async def get(self, data_type: str, key: str) -> Optional[dict]:
        """Get from cache using appropriate strategy."""
        policy = self.get_policy(data_type)
        
        if policy.strategy == InvalidationStrategy.VERSIONED:
            return await self._get_versioned(data_type, key, policy)
        else:
            return await self._get_simple(data_type, key)
    
    async def set(self, data_type: str, key: str, value: dict) -> None:
        """Set in cache using appropriate strategy."""
        policy = self.get_policy(data_type)
        
        if policy.strategy == InvalidationStrategy.VERSIONED:
            await self._set_versioned(data_type, key, value, policy)
        else:
            await self._set_simple(data_type, key, value, policy.ttl)
    
    async def _get_simple(self, data_type: str, key: str) -> Optional[dict]:
        cache_key = f"{data_type}:{key}"
        cached = await self.redis.get(cache_key)
        return json.loads(cached) if cached else None
    
    async def _set_simple(self, data_type: str, key: str, value: dict, ttl: int):
        cache_key = f"{data_type}:{key}"
        await self.redis.setex(cache_key, ttl, json.dumps(value))
    
    async def _get_versioned(self, data_type: str, key: str, policy: CachePolicy):
        version = await self.redis.get(policy.version_key) or "1"
        cache_key = f"{data_type}:{key}:v{version}"
        cached = await self.redis.get(cache_key)
        return json.loads(cached) if cached else None
    
    async def _set_versioned(self, data_type: str, key: str, value: dict, policy: CachePolicy):
        version = await self.redis.get(policy.version_key) or "1"
        cache_key = f"{data_type}:{key}:v{version}"
        await self.redis.setex(cache_key, policy.ttl, json.dumps(value))

3.3 Strategy Comparison

Strategy Freshness Complexity Use Case
TTL only Minutes Low Static content, tolerant reads
Event-driven Seconds High Prices, inventory, critical data
Versioned Seconds Medium Bulk updates, deployments
TTL + Events Seconds Medium-High Best balance for most systems
Write-through Immediate Medium Critical single-writer data

Part II: Implementation

Chapter 4: Building an Invalidation System

4.1 Requirements

A production invalidation system needs:

  1. Multiple strategies — Different data needs different freshness
  2. Event handling — Process invalidation events reliably
  3. Bulk operations — Invalidate many keys efficiently
  4. Monitoring — Track invalidation success/failure
  5. Failure handling — Don't lose invalidations

4.2 Production Implementation

# Production Cache Invalidation System

import asyncio
import json
import logging
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List, Optional, Set
from enum import Enum

logger = logging.getLogger(__name__)


# =============================================================================
# Configuration
# =============================================================================

@dataclass
class InvalidationConfig:
    """Configuration for invalidation system."""
    
    # Event processing
    event_topic: str = "cache-invalidation"
    consumer_group: str = "cache-invalidator"
    batch_size: int = 100
    
    # Safety
    max_keys_per_invalidation: int = 1000
    enable_pattern_delete: bool = False  # Dangerous at scale
    
    # Monitoring
    enable_metrics: bool = True
    slow_invalidation_threshold_ms: float = 100


# =============================================================================
# Invalidation Events
# =============================================================================

class InvalidationType(Enum):
    DELETE = "delete"           # Delete specific keys
    PATTERN = "pattern"         # Delete keys matching pattern
    VERSION = "version"         # Increment version
    REFRESH = "refresh"         # Delete and pre-warm


@dataclass
class InvalidationEvent:
    """Event requesting cache invalidation."""
    type: InvalidationType
    keys: List[str] = field(default_factory=list)
    pattern: Optional[str] = None
    version_key: Optional[str] = None
    source: str = "unknown"
    timestamp: datetime = field(default_factory=datetime.utcnow)
    correlation_id: Optional[str] = None
    
    def to_dict(self) -> dict:
        return {
            "type": self.type.value,
            "keys": self.keys,
            "pattern": self.pattern,
            "version_key": self.version_key,
            "source": self.source,
            "timestamp": self.timestamp.isoformat(),
            "correlation_id": self.correlation_id
        }
    
    @classmethod
    def from_dict(cls, data: dict) -> 'InvalidationEvent':
        return cls(
            type=InvalidationType(data["type"]),
            keys=data.get("keys", []),
            pattern=data.get("pattern"),
            version_key=data.get("version_key"),
            source=data.get("source", "unknown"),
            timestamp=datetime.fromisoformat(data["timestamp"]),
            correlation_id=data.get("correlation_id")
        )


# =============================================================================
# Invalidation Publisher (Use in application services)
# =============================================================================

class InvalidationPublisher:
    """
    Publishes cache invalidation events.
    
    Use this in your application services when data changes.
    """
    
    def __init__(self, kafka_producer, config: InvalidationConfig):
        self.kafka = kafka_producer
        self.config = config
    
    async def invalidate_keys(
        self,
        keys: List[str],
        source: str = "application",
        correlation_id: str = None
    ) -> None:
        """Invalidate specific cache keys."""
        if len(keys) > self.config.max_keys_per_invalidation:
            # Split into batches
            for i in range(0, len(keys), self.config.max_keys_per_invalidation):
                batch = keys[i:i + self.config.max_keys_per_invalidation]
                await self._publish(InvalidationEvent(
                    type=InvalidationType.DELETE,
                    keys=batch,
                    source=source,
                    correlation_id=correlation_id
                ))
        else:
            await self._publish(InvalidationEvent(
                type=InvalidationType.DELETE,
                keys=keys,
                source=source,
                correlation_id=correlation_id
            ))
    
    async def invalidate_pattern(
        self,
        pattern: str,
        source: str = "application"
    ) -> None:
        """Invalidate keys matching pattern (use sparingly!)."""
        if not self.config.enable_pattern_delete:
            raise ValueError("Pattern delete is disabled in config")
        
        await self._publish(InvalidationEvent(
            type=InvalidationType.PATTERN,
            pattern=pattern,
            source=source
        ))
    
    async def increment_version(
        self,
        version_key: str,
        source: str = "application"
    ) -> None:
        """Increment a version key for versioned invalidation."""
        await self._publish(InvalidationEvent(
            type=InvalidationType.VERSION,
            version_key=version_key,
            source=source
        ))
    
    async def _publish(self, event: InvalidationEvent) -> None:
        """Publish event to Kafka."""
        await self.kafka.send(
            self.config.event_topic,
            value=json.dumps(event.to_dict()).encode()
        )
        logger.debug(f"Published invalidation event: {event.type.value}")


# =============================================================================
# Invalidation Consumer (Background service)
# =============================================================================

class InvalidationConsumer:
    """
    Consumes and processes cache invalidation events.
    
    Run this as a separate service or background worker.
    """
    
    def __init__(
        self,
        kafka_consumer,
        redis_client,
        config: InvalidationConfig,
        metrics_client = None
    ):
        self.kafka = kafka_consumer
        self.redis = redis_client
        self.config = config
        self.metrics = metrics_client
        
        self._running = False
    
    async def start(self):
        """Start consuming invalidation events."""
        self._running = True
        logger.info("Invalidation consumer started")
        
        batch = []
        
        async for message in self.kafka:
            if not self._running:
                break
            
            try:
                event = InvalidationEvent.from_dict(
                    json.loads(message.value.decode())
                )
                batch.append(event)
                
                # Process in batches for efficiency
                if len(batch) >= self.config.batch_size:
                    await self._process_batch(batch)
                    batch = []
                    await self.kafka.commit()
                    
            except Exception as e:
                logger.error(f"Failed to process event: {e}")
                # Individual event failure - continue processing
    
    async def stop(self):
        """Stop the consumer."""
        self._running = False
        logger.info("Invalidation consumer stopped")
    
    async def _process_batch(self, events: List[InvalidationEvent]) -> None:
        """Process a batch of invalidation events."""
        start_time = datetime.utcnow()
        
        # Group by type for efficient processing
        delete_keys: Set[str] = set()
        patterns: List[str] = []
        version_keys: List[str] = []
        
        for event in events:
            if event.type == InvalidationType.DELETE:
                delete_keys.update(event.keys)
            elif event.type == InvalidationType.PATTERN:
                patterns.append(event.pattern)
            elif event.type == InvalidationType.VERSION:
                version_keys.append(event.version_key)
        
        # Process deletions in bulk
        if delete_keys:
            await self._delete_keys(list(delete_keys))
        
        # Process patterns (one at a time - expensive!)
        for pattern in patterns:
            await self._delete_pattern(pattern)
        
        # Process version increments
        for version_key in version_keys:
            await self._increment_version(version_key)
        
        # Record metrics
        duration_ms = (datetime.utcnow() - start_time).total_seconds() * 1000
        
        if self.metrics:
            self.metrics.timing("invalidation.batch_duration_ms", duration_ms)
            self.metrics.increment("invalidation.keys_deleted", len(delete_keys))
            self.metrics.increment("invalidation.events_processed", len(events))
        
        if duration_ms > self.config.slow_invalidation_threshold_ms:
            logger.warning(f"Slow invalidation batch: {duration_ms:.2f}ms")
    
    async def _delete_keys(self, keys: List[str]) -> int:
        """Delete multiple keys efficiently."""
        if not keys:
            return 0
        
        # Use pipeline for bulk delete
        pipe = self.redis.pipeline()
        for key in keys:
            pipe.delete(key)
        results = await pipe.execute()
        
        deleted = sum(1 for r in results if r)
        logger.debug(f"Deleted {deleted}/{len(keys)} keys")
        return deleted
    
    async def _delete_pattern(self, pattern: str) -> int:
        """Delete keys matching pattern (use sparingly!)."""
        # SCAN to find matching keys (non-blocking)
        deleted = 0
        cursor = 0
        
        while True:
            cursor, keys = await self.redis.scan(
                cursor=cursor,
                match=pattern,
                count=100
            )
            
            if keys:
                await self.redis.delete(*keys)
                deleted += len(keys)
            
            if cursor == 0:
                break
        
        logger.info(f"Pattern delete '{pattern}' removed {deleted} keys")
        return deleted
    
    async def _increment_version(self, version_key: str) -> int:
        """Increment a version key."""
        new_version = await self.redis.incr(version_key)
        logger.debug(f"Incremented {version_key} to {new_version}")
        return new_version


# =============================================================================
# Helper: Invalidation Decorator
# =============================================================================

def invalidate_cache(*keys: str, publisher: InvalidationPublisher = None):
    """
    Decorator that invalidates cache after function execution.
    
    Usage:
        @invalidate_cache("product:{product_id}", "category:{category_id}")
        async def update_product(product_id: str, category_id: str, data: dict):
            ...
    """
    def decorator(func):
        async def wrapper(*args, **kwargs):
            result = await func(*args, **kwargs)
            
            # Format keys with function arguments
            formatted_keys = []
            for key_template in keys:
                try:
                    formatted_key = key_template.format(**kwargs)
                    formatted_keys.append(formatted_key)
                except KeyError:
                    logger.warning(f"Could not format key template: {key_template}")
            
            if formatted_keys and publisher:
                await publisher.invalidate_keys(
                    formatted_keys,
                    source=func.__name__
                )
            
            return result
        return wrapper
    return decorator


# =============================================================================
# Usage Example
# =============================================================================

# In your product service:
class ProductService:
    def __init__(self, db, cache, invalidation: InvalidationPublisher):
        self.db = db
        self.cache = cache
        self.invalidation = invalidation
    
    async def update_product(self, product_id: str, data: dict) -> dict:
        # Get old product for category info
        old_product = await self.db.fetch_one(
            "SELECT category_id FROM products WHERE id = $1",
            product_id
        )
        
        # Update database
        product = await self.db.fetch_one(
            "UPDATE products SET ... WHERE id = $1 RETURNING *",
            product_id
        )
        
        # Invalidate all related caches
        await self.invalidation.invalidate_keys([
            f"product:{product_id}",
            f"product_page:{product_id}",
            f"category:{old_product['category_id']}",
        ], source="ProductService.update_product")
        
        return dict(product)
    
    async def bulk_import(self, products: List[dict]) -> int:
        """Import many products - use version increment."""
        # Import to database
        count = await self._bulk_insert(products)
        
        # Invalidate all product caches via version
        await self.invalidation.increment_version("product_cache_version")
        
        return count

Chapter 5: Edge Cases and Error Handling

5.1 Edge Case 1: Invalidation Event Lost

SCENARIO: Kafka loses the invalidation event

Timeline:
  T0: Product price updated in DB ($100 → $79)
  T1: Invalidation event published to Kafka
  T2: Kafka broker fails, event lost!
  T3: Cache still has old price ($100)
  T4: Users see wrong price until TTL expires

SOLUTION: Safety net TTL

Even with event-driven invalidation, always set a TTL:
  - Events handle 99% of invalidations (fast)
  - TTL handles the 1% where events fail (safety)

Maximum staleness = min(event_latency, TTL)

5.2 Edge Case 2: Thundering Herd After Invalidation

SCENARIO: Popular item cache invalidated during peak traffic

Timeline:
  T0: Flash sale starts, price updated
  T1: Cache invalidated for product:123
  T2: 10,000 concurrent requests for product:123
  T3: All 10,000 hit database simultaneously!
  T4: Database overwhelmed

SOLUTION: Request coalescing (covered in Day 3)

For now, awareness is key:
  - Invalidation can trigger thundering herd
  - Popular items need extra protection
  - Consider "soft" invalidation (mark stale, refresh async)

5.3 Edge Case 3: Cache and Database Inconsistency

SCENARIO: Invalidation happens before DB transaction commits

Dangerous order:
  T0: Begin transaction
  T1: Update product in DB (not committed)
  T2: Invalidate cache
  T3: Another request reads DB (sees OLD value)
  T4: Populates cache with OLD value
  T5: Transaction commits (DB now has NEW value)
  T6: Cache has OLD value, DB has NEW value!

SOLUTION: Invalidate AFTER commit

async def update_product(product_id: str, data: dict):
    async with db.transaction() as tx:
        product = await tx.fetch_one(
            "UPDATE products SET ... WHERE id = $1 RETURNING *",
            product_id
        )
    # Transaction committed!
    
    # NOW invalidate cache
    await cache.delete(f"product:{product_id}")
    
    return dict(product)

5.4 Edge Case 4: Partial Invalidation Failure

SCENARIO: Need to invalidate 5 keys, 2 fail

Keys to invalidate:
  - product:123      ✓ Success
  - product_page:123 ✓ Success
  - category:456     ✗ Redis timeout
  - search:results   ✗ Redis timeout
  - deals:page       ✓ Success

SOLUTION: Retry with idempotency

Invalidation is idempotent - safe to retry!
DEL on a non-existent key = success

async def invalidate_with_retry(keys: List[str], max_retries: int = 3):
    remaining = set(keys)
    
    for attempt in range(max_retries):
        failed = []
        
        for key in remaining:
            try:
                await redis.delete(key)
            except Exception as e:
                failed.append(key)
                logger.warning(f"Invalidation failed for {key}: {e}")
        
        remaining = set(failed)
        
        if not remaining:
            return  # All succeeded
        
        # Exponential backoff
        await asyncio.sleep(0.1 * (2 ** attempt))
    
    # Some still failed - alert!
    logger.error(f"Failed to invalidate keys after {max_retries} attempts: {remaining}")
    # Consider: DLQ, manual intervention alert

5.5 Error Handling Matrix

Error Impact Handling Prevention
Event lost Stale data until TTL TTL safety net Kafka replication
Redis timeout Invalidation delayed Retry with backoff Redis cluster
Pattern delete slow Blocks other invalidations Avoid patterns, batch instead Don't use patterns
Version key lost Wrong version used Rebuild from DB Persist versions
Consumer crash Events pile up Auto-restart, resume from offset Health checks

Part III: Real-World Application

Chapter 7: How Big Tech Does It

7.1 Case Study: Facebook — Invalidation at Scale

FACEBOOK'S CACHE INVALIDATION

Scale:
  - Billions of cache operations per second
  - Distributed across multiple datacenters
  - Consistency across regions matters

System: McRouter + TAO

Architecture:

  Write Request
       │
       ▼
  ┌─────────────────┐
  │  TAO (Graph     │
  │   Database)     │
  │                 │
  │  Write to MySQL │
  │  + Invalidate   │
  └────────┬────────┘
           │
           │ Invalidation
           ▼
  ┌─────────────────┐
  │   McRouter      │
  │  (Memcached     │
  │   Router)       │
  │                 │
  │ Broadcasts to   │
  │ all regions     │
  └─────────────────┘


Key Innovation: LEASE-BASED INVALIDATION

Problem: Race between invalidation and cache fill

Solution: Leases
  1. On cache miss, get a "lease" (token)
  2. Load from database
  3. Set cache with lease token
  4. If lease was invalidated, set fails!

This prevents stale data from being cached after invalidation.

7.2 Case Study: Netflix — Event-Driven Invalidation

NETFLIX CACHE INVALIDATION

System: EVCache + Event-Driven

Architecture:

  Content Update
       │
       ▼
  ┌─────────────────┐
  │  Content        │
  │  Service        │
  │                 │
  │  Update DB      │
  │  Publish Event  │
  └────────┬────────┘
           │
           │ Kafka Event
           ▼
  ┌─────────────────┐
  │  Cache          │
  │  Invalidator    │
  │                 │
  │  Subscribe to   │
  │  content events │
  │                 │
  │  Invalidate     │
  │  EVCache        │
  └─────────────────┘


Key Features:

1. ZONE-AWARE INVALIDATION
   Each availability zone has its own cache
   Invalidation sent to ALL zones
   
2. TTL AS SAFETY NET
   All entries have 24-hour TTL
   Events handle 99.9% of invalidations
   TTL catches the rest

3. VERSIONED CONTENT
   Content has version numbers
   Cache key includes version
   New version = automatic miss

7.3 Case Study: Amazon — Granular Invalidation

AMAZON PRODUCT CACHE INVALIDATION

Challenge:
  - Different parts of product change at different rates
  - Inventory: every purchase
  - Price: hourly/daily
  - Description: weekly/monthly
  
Solution: DECOMPOSED CACHING

Instead of one cache entry per product:
  product:123:details      TTL: 24 hours
  product:123:price        TTL: 5 min + events
  product:123:inventory    TTL: 30 sec + events
  product:123:reviews      TTL: 1 hour

Benefits:
  - Inventory change doesn't invalidate description
  - Price change doesn't invalidate reviews
  - Each piece has appropriate freshness

Implementation:
  async def get_product_page(product_id: str) -> dict:
      # Parallel fetch of decomposed cache
      details, price, inventory, reviews = await asyncio.gather(
          cache.get(f"product:{product_id}:details"),
          cache.get(f"product:{product_id}:price"),
          cache.get(f"product:{product_id}:inventory"),
          cache.get(f"product:{product_id}:reviews"),
      )
      
      # Load missing pieces from DB
      # ... 
      
      return {
          "details": details,
          "price": price,
          "inventory": inventory,
          "reviews": reviews
      }

7.4 Summary: Industry Patterns

Company Strategy Key Innovation
Facebook Lease-based Prevents stale fills
Netflix Event-driven + TTL Zone-aware
Amazon Decomposed caching Granular invalidation
Twitter Write-through Consistency for timeline
Uber Short TTL Real-time location

Chapter 8: Common Mistakes to Avoid

❌ WRONG: Only invalidate direct cache

async def update_product(product_id: str, data: dict):
    await db.update(product_id, data)
    await cache.delete(f"product:{product_id}")
    # FORGOT: category listing, search results, homepage!


✅ CORRECT: Invalidate all related caches

async def update_product(product_id: str, data: dict):
    # Get current product for related info
    old_product = await db.get(product_id)
    
    # Update database
    await db.update(product_id, data)
    
    # Invalidate ALL related caches
    await invalidation.invalidate_keys([
        f"product:{product_id}",
        f"product_page:{product_id}",
        f"category:{old_product['category_id']}",
        f"brand:{old_product['brand_id']}",
        # If price changed, also deals page
        *(["deals:page"] if data.get('price') else []),
    ])

8.2 Mistake 2: Invalidating Before Commit

❌ WRONG: Invalidate inside transaction

async def update_product(product_id: str, data: dict):
    async with db.transaction() as tx:
        await tx.update(product_id, data)
        await cache.delete(f"product:{product_id}")  # BAD!
        # If transaction rolls back, cache is already invalidated
        # Another request might cache OLD value


✅ CORRECT: Invalidate after commit

async def update_product(product_id: str, data: dict):
    async with db.transaction() as tx:
        await tx.update(product_id, data)
    
    # Transaction committed - NOW invalidate
    await cache.delete(f"product:{product_id}")

8.3 Mistake 3: Using Pattern Delete at Scale

❌ WRONG: Pattern delete in hot path

async def update_category(category_id: str, data: dict):
    await db.update(category_id, data)
    
    # This scans ALL keys in Redis!
    await redis.delete_pattern(f"product:*:category:{category_id}")
    # At scale: millions of keys scanned, Redis blocked


✅ CORRECT: Track keys explicitly

# Maintain a set of keys per category
async def cache_product(product_id: str, category_id: str, data: dict):
    await cache.set(f"product:{product_id}", data)
    # Track this product in category's key set
    await redis.sadd(f"category_products:{category_id}", f"product:{product_id}")

async def invalidate_category(category_id: str):
    # Get tracked keys
    product_keys = await redis.smembers(f"category_products:{category_id}")
    # Delete specific keys (fast!)
    if product_keys:
        await redis.delete(*product_keys)
    await redis.delete(f"category_products:{category_id}")

8.4 Mistake 4: No TTL Safety Net

❌ WRONG: Event-driven only, no TTL

async def cache_product(product_id: str, data: dict):
    # No TTL! Relies entirely on events
    await redis.set(f"product:{product_id}", json.dumps(data))

# If event is lost, data is stale FOREVER


✅ CORRECT: Always have TTL safety net

async def cache_product(product_id: str, data: dict):
    # TTL as safety net - even if event lost, max staleness is 5 min
    await redis.setex(
        f"product:{product_id}",
        300,  # 5 minute safety TTL
        json.dumps(data)
    )

8.5 Mistake Checklist

Before deploying invalidation strategy:

  • Related caches identified — All caches affected by each data change
  • Invalidation after commit — Never inside transaction
  • TTL safety net — Even with events, have TTL fallback
  • No pattern deletes — Use explicit key tracking instead
  • Retry logic — Invalidation should retry on failure
  • Monitoring — Track invalidation latency and failures
  • Documented dependencies — Map data → cache relationships

Part IV: Interview Preparation

Chapter 9: Interview Tips and Phrases

9.1 When to Discuss Invalidation

Bring up invalidation when:

  • Designing any caching layer
  • Data freshness is mentioned as a requirement
  • Interviewer asks "what if data changes?"
  • System involves prices, inventory, or real-time data

9.2 Key Phrases to Use

INTRODUCING INVALIDATION:

"For caching, the key question is invalidation. We need to 
ensure users see fresh data while still benefiting from the 
cache. Let me walk through our options."


EXPLAINING TTL:

"The simplest approach is TTL-based expiration. With a 5-minute 
TTL, we accept up to 5 minutes of staleness in exchange for 
simplicity. This works well for data like product descriptions 
that rarely change."


EXPLAINING EVENT-DRIVEN:

"For price changes during a flash sale, TTL isn't fast enough. 
I'd use event-driven invalidation: when the price updates, we 
publish an event, and a consumer invalidates the cache within 
milliseconds. Combined with a short TTL as a safety net, we 
get near real-time freshness."


DISCUSSING TRADE-OFFS:

"The trade-off is freshness versus complexity. TTL is simple 
but stale. Event-driven is fresh but requires infrastructure. 
For this system, I'd use event-driven for prices and inventory, 
but TTL-only for product descriptions."


HANDLING FAILURE QUESTIONS:

"If the invalidation event is lost—maybe Kafka has an issue—
the TTL safety net kicks in. Data might be stale for up to 5 
minutes instead of milliseconds, but it won't be stale forever. 
That's why I always combine events with TTL."

9.3 Questions to Ask Interviewer

  • "How fresh does this data need to be? Seconds, minutes, hours?"
  • "What's the acceptable user experience if they see slightly stale data?"
  • "How frequently does this data change?"
  • "Is there existing event infrastructure we can leverage?"

9.4 Common Follow-up Questions

Question Good Answer
"What if you need to invalidate millions of keys?" "I'd use versioned keys with a global version. Incrementing one version key effectively invalidates everything. Old entries expire via TTL."
"How do you handle cache and DB getting out of sync?" "Always invalidate after transaction commits. Use TTL as safety net. If we detect inconsistency, we can force refresh by bumping the version."
"What about thundering herd after invalidation?" "That's a risk. For popular items, I'd use request coalescing—multiple requests for the same key share a single database fetch. Also consider 'soft' invalidation with async refresh."
"How do you test invalidation?" "Integration tests that verify cache state after updates. Load tests that simulate concurrent updates. Chaos testing that kills the invalidation consumer."

Chapter 10: Practice Problems

Problem 1: E-commerce Price Invalidation

Setup: Design cache invalidation for an e-commerce site with flash sales. When a flash sale starts, 10,000 products have price changes that must be visible to users within 1 second.

Requirements:

  • 10,000 products in a flash sale
  • Price must update within 1 second
  • 100,000 users may be viewing products during sale start
  • Can't overwhelm the database

Questions:

  1. How would you invalidate 10,000 product caches quickly?
  2. How do you prevent thundering herd when everyone reloads?
  3. What's your fallback if invalidation fails?
  • Can you update cache instead of invalidate?
  • Consider bulk operations
  • Think about cache warming before the sale

Approach: Pre-compute + Write-Update + Fallback

async def start_flash_sale(sale_id: str, product_prices: Dict[str, Decimal]):
    """
    Start flash sale with immediate price visibility.
    """
    # BEFORE SALE: Pre-warm cache with new prices
    # Scheduled 1 second before sale start
    await pre_warm_flash_sale_prices(sale_id, product_prices)
    
    # AT SALE START: Atomic version flip
    await redis.set("active_flash_sale", sale_id)
    
    # Products check for active sale and use sale price


async def pre_warm_flash_sale_prices(sale_id: str, prices: Dict):
    """Pre-warm cache with flash sale prices."""
    pipe = redis.pipeline()
    
    for product_id, price in prices.items():
        # Store sale price separately
        pipe.setex(
            f"flash_price:{sale_id}:{product_id}",
            3600,  # 1 hour (sale duration)
            str(price)
        )
    
    await pipe.execute()  # Bulk write - fast!


async def get_product_price(product_id: str) -> Decimal:
    """Get price, checking for active flash sale."""
    # Check for active flash sale
    sale_id = await redis.get("active_flash_sale")
    
    if sale_id:
        sale_price = await redis.get(f"flash_price:{sale_id}:{product_id}")
        if sale_price:
            return Decimal(sale_price)
    
    # No sale or product not in sale - use regular price
    return await get_cached_price(product_id)

Thundering Herd Prevention:

  • Pre-warm cache BEFORE sale starts
  • No invalidation needed at sale start
  • Just flip a single flag

Fallback:

  • If pre-warm fails, fall back to database
  • Short TTL (30 sec) ensures fresh prices eventually
  • Can retry pre-warm in background

Problem 2: Social Media Feed Invalidation

Setup: Design invalidation for a social media feed cache. When a user posts, all their followers should see the post in their feeds. User has between 1 and 10 million followers.

Requirements:

  • Post visible in followers' feeds within 30 seconds
  • Can't invalidate 10 million cache entries synchronously
  • Feed should always load fast (<100ms)

Questions:

  1. How do you handle invalidation for users with millions of followers?
  2. What's different about invalidating a "celebrity" vs regular user?
  3. How do you balance freshness and performance?
  • Consider hybrid push/pull model
  • Not all followers are online
  • Think about which feeds actually need immediate updates

Approach: Hybrid Push/Pull with Activity-Based Invalidation

# Follower tiers
CELEBRITY_THRESHOLD = 10_000  # Followers

async def on_user_post(user_id: str, post: dict):
    """Handle new post - different strategies by follower count."""
    
    follower_count = await get_follower_count(user_id)
    
    if follower_count < CELEBRITY_THRESHOLD:
        # Regular user: Push to all followers
        await push_to_follower_feeds(user_id, post)
    else:
        # Celebrity: Don't push, pull on read
        await mark_celebrity_post(user_id, post)


async def push_to_follower_feeds(user_id: str, post: dict):
    """Push post to all follower feed caches."""
    followers = await get_followers(user_id)
    
    # Only push to ACTIVE followers (online recently)
    active_followers = [f for f in followers if await is_active(f)]
    
    # Invalidate their feed caches
    keys = [f"feed:{f}" for f in active_followers]
    await invalidation.invalidate_keys(keys)
    
    # Inactive followers will get fresh feed on next login


async def get_feed(user_id: str) -> List[dict]:
    """Get feed, merging cached + celebrity posts."""
    
    # Get cached feed (pushed posts from regular users)
    cached_feed = await cache.get(f"feed:{user_id}")
    
    # Get celebrity posts (pulled on read)
    celebrity_posts = await get_followed_celebrity_posts(user_id)
    
    # Merge and sort
    all_posts = (cached_feed or []) + celebrity_posts
    all_posts.sort(key=lambda p: p['timestamp'], reverse=True)
    
    return all_posts[:50]  # Top 50 posts

Key Insights:

  • Don't invalidate inactive users' caches
  • Celebrity posts pulled on read (no push amplification)
  • Regular users' posts pushed to active followers
  • Hybrid approach scales to any follower count

Problem 3: Multi-Region Cache Invalidation

Setup: Your application runs in 3 regions (US, EU, Asia). Each region has its own Redis cluster. When data changes, all regions must invalidate within 5 seconds.

Requirements:

  • Write happens in one region
  • All 3 regions must invalidate
  • Network between regions has ~100ms latency
  • If one region's Redis is down, others should still invalidate

Questions:

  1. How do you propagate invalidation across regions?
  2. How do you handle a region being temporarily unreachable?
  3. How do you ensure consistency across regions?
  • Consider async event propagation
  • Each region needs independent processing
  • Think about conflict resolution

Approach: Event Bus with Regional Consumers

# Global event bus (Kafka with multi-region replication)

async def invalidate_global(keys: List[str], source_region: str):
    """Publish invalidation to all regions."""
    
    event = {
        "type": "invalidate",
        "keys": keys,
        "source_region": source_region,
        "timestamp": datetime.utcnow().isoformat(),
        "id": str(uuid.uuid4())
    }
    
    # Publish to global topic (replicated to all regions)
    await kafka.send("global-cache-invalidation", event)


# Each region runs its own consumer
class RegionalInvalidationConsumer:
    def __init__(self, region: str, redis_client):
        self.region = region
        self.redis = redis_client
        self.processed_ids = TTLCache(ttl=300)  # Dedup
    
    async def process(self, event: dict):
        # Deduplication (event might arrive multiple times)
        if event["id"] in self.processed_ids:
            return
        self.processed_ids[event["id"]] = True
        
        # Invalidate in this region
        await self.redis.delete(*event["keys"])
        
        logger.info(f"Region {self.region} invalidated {len(event['keys'])} keys")


# Regional Redis failure handling
async def invalidate_with_fallback(redis_client, keys: List[str]):
    try:
        await redis_client.delete(*keys)
    except RedisError as e:
        logger.error(f"Regional Redis down: {e}")
        # Store invalidation for retry when Redis recovers
        await store_pending_invalidation(keys)
        # Alert ops team
        await alert("Regional cache invalidation failed")

Architecture:

Write in US-East
      │
      ▼
Global Kafka Topic (replicated to all regions)
      │
      ├────────────────────────────────────────┐
      │                                        │
      ▼                                        ▼
US Consumer                              EU Consumer
      │                                        │
      ▼                                        ▼
US Redis                                 EU Redis
                                              │
                                        Asia Consumer
                                              │
                                              ▼
                                        Asia Redis

Key Insights:

  • Global event bus handles cross-region propagation
  • Each region processes independently
  • Deduplication prevents double-processing
  • Pending queue handles temporary Redis outage
  • 5 second target achieved with Kafka replication

Chapter 11: Mock Interview Dialogue

Scenario: Design Invalidation for a News Site

Interviewer: "We're building a news site. Articles are written by journalists, edited, then published. Once published, they might be updated with corrections. How would you handle cache invalidation?"

You: "Good question. Let me understand the data flow first.

So an article goes through: draft → editing → published → possibly updated. Users only see published articles. How frequently are published articles updated?"

Interviewer: "Breaking news might get updates every few minutes. Regular articles maybe once a day if there's a correction."

You: "Got it. And what's the traffic pattern? I assume there's a power law—some articles get millions of views, most get very few?"

Interviewer: "Exactly. Top 100 articles might get 90% of traffic."

You: "Perfect. Here's my approach:

For invalidation strategy, I'd use event-driven invalidation with TTL safety net.

When an article is updated:

  1. Editor saves changes to database
  2. CMS publishes an 'article.updated' event to Kafka
  3. Cache invalidation consumer receives event
  4. Deletes the article cache entry

For the TTL, I'd use 5 minutes. Most articles don't change often, so a 5-minute stale window is acceptable. But the event-driven invalidation means most updates are visible within seconds.

Interviewer: "What about the homepage that shows the latest articles?"

You: "The homepage is different. It's not one article—it's a computed view of 'top 10 articles right now.' I'd handle it differently:

For the homepage, I'd use short TTL with background refresh. TTL of 60 seconds, and a background job that refreshes the cache every 30 seconds. This means:

  • Homepage is always served from cache (fast)
  • Never more than 30 seconds stale
  • No thundering herd because background job refreshes it

For the 'breaking news' banner specifically, I might use even shorter TTL—10 seconds—or even skip caching entirely since it's a single small query.

Interviewer: "What if an article is published with a typo and the editor fixes it immediately? The cache has the typo for 5 minutes?"

You: "That's where event-driven invalidation helps. The flow is:

  1. Article published with typo
  2. Cache populated with typo
  3. Editor notices, fixes typo
  4. 'article.updated' event published
  5. Cache invalidated within milliseconds
  6. Next reader gets corrected version

The 5-minute TTL is just a safety net for if the event is lost. In normal operation, the event invalidates the cache almost immediately.

To make it even faster, I could update the cache directly instead of invalidating:

async def on_article_updated(event):
    article_id = event['article_id']
    
    # Option A: Invalidate (requires DB hit on next read)
    await cache.delete(f"article:{article_id}")
    
    # Option B: Update (no DB hit, faster for readers)
    article = event['data']  # Include full article in event
    await cache.set(f"article:{article_id}", article, ttl=300)

I'd use Option B for breaking news articles where every second counts."

Interviewer: "How do you handle a situation where the CMS is down but you need to invalidate cache?"

You: "That's an important failure scenario. A few options:

  1. TTL as fallback — Even without CMS, cache expires in 5 minutes. Not ideal but bounded staleness.

  2. Manual invalidation tool — Build an admin endpoint that ops can call directly:

    POST /admin/invalidate
    {"keys": ["article:12345"]}
    
  3. Version-based emergency invalidation — If we need to invalidate ALL articles (major bug, security issue):

    # Increment global version
    await redis.incr("article_cache_version")
    # All article cache keys now miss
    

The version approach is like a nuclear option—use it rarely, but it's good to have."


Summary

DAY 2 KEY TAKEAWAYS

CORE PROBLEM:
• Cache invalidation is hard because cache doesn't know when data changes
• Balance freshness, simplicity, and performance
• Different data needs different strategies

INVALIDATION STRATEGIES:

TTL-Based:
  • Simplest approach
  • Data stale for up to TTL duration
  • Use for: Static content, low-change data

Event-Driven:
  • Near real-time freshness
  • Requires event infrastructure
  • Use for: Prices, inventory, critical data

Versioned Keys:
  • Increment version to invalidate
  • Good for bulk invalidation
  • Use for: Deployments, bulk imports

Hybrid (TTL + Events):
  • Best of both worlds
  • Events for speed, TTL for safety
  • Use for: Most production systems

KEY DECISIONS:
• Write-invalidate (delete) safer than write-update
• Always invalidate AFTER transaction commits
• Always have TTL safety net with event-driven

COMMON MISTAKES:
• Forgetting related caches
• Invalidating inside transaction
• Pattern delete at scale
• No TTL safety net

DEFAULT CHOICE:
• Event-driven + 5 minute TTL safety net
• Write-invalidate (delete on change)
• Decomposed caching for different freshness needs

📚 Further Reading

Official Documentation

Engineering Blogs

  • Facebook TAO: "TAO: Facebook's Distributed Data Store for the Social Graph"
  • Netflix EVCache: "Caching for a Global Netflix"
  • Uber Schemaless: "Designing Schemaless, Uber Engineering's Scalable Datastore"

Papers

Books

  • "Designing Data-Intensive Applications" by Martin Kleppmann — Chapter 5

End of Day 2: Invalidation Strategies

Tomorrow: Day 3 — Thundering Herd. We've learned how to invalidate caches. Tomorrow, we tackle what happens when thousands of requests simultaneously hit a cache miss. You'll learn locking, request coalescing, and probabilistic early expiration—the tools to protect your database from cache stampedes.