Himanshu Kukreja
0%
LearnSystem DesignWeek 2Idempotency In Practice
Day 02

Week 2 — Day 2: Idempotency in Practice

System Design Mastery Series


Preface

Yesterday, we built a payment service with proper timeouts. We solved the "waiting forever" problem. But we left a dangerous hole:

User clicks "Pay $99"
  → Your server calls bank API
  → Bank API times out after 3.5 seconds
  → You return: "Payment processing timed out"

Question: Did the bank charge the user or not?

Answer: You don't know.

The bank might have:

  1. Never received your request (network died before it arrived)
  2. Received it, processed it, but the response got lost
  3. Received it, still processing, will complete in 1 second

If you tell the user "try again" and they do, you might charge them twice.

This is the problem idempotency solves.

Today, we make our payment system safe to retry. No matter how many times a user clicks "Pay," they'll be charged exactly once.


Part I: Foundations

Chapter 1: What Is Idempotency?

1.1 The Simple Definition

An operation is idempotent if doing it multiple times has the same effect as doing it once.

Idempotent operations (safe to repeat):
  
  "Set the thermostat to 72°F"
    → Do it once: temperature is 72°F
    → Do it twice: temperature is still 72°F
    → Do it 100 times: temperature is still 72°F ✓
  
  "Delete file X"
    → Do it once: file is deleted
    → Do it twice: file is still deleted (no-op second time)
    → No harm in repeating ✓

Non-idempotent operations (dangerous to repeat):

  "Add $10 to account"
    → Do it once: +$10
    → Do it twice: +$20 (double the intended effect!)
    → Repeating causes harm ✗
  
  "Send email to user"
    → Do it once: 1 email
    → Do it twice: 2 emails (spam!)
    → Repeating causes harm ✗

1.2 The Everyday Analogy: The Light Switch

Think about pressing a light switch:

Toggle switch (non-idempotent):
  Press once: Light turns ON
  Press again: Light turns OFF
  Press again: Light turns ON
  
  Each press changes the state. Dangerous if you're not sure
  how many times you pressed!

ON/OFF buttons (idempotent):
  Press ON: Light is ON
  Press ON again: Light is still ON
  Press ON 10 times: Light is still ON
  
  No matter how many times you press, you get the same result.

We want our payment system to behave like ON/OFF buttons, not toggle switches.

1.3 Why Distributed Systems Need Idempotency

In a perfect world:

  1. Client sends request
  2. Server processes it
  3. Server sends response
  4. Client receives response

But networks fail:

Scenario 1: Request lost
  Client → [request lost] → Server
  Server never sees request, nothing happens.
  Client times out. Safe to retry. ✓

Scenario 2: Response lost
  Client → Request → Server
  Server processes successfully
  Client ← [response lost] ← Server
  Client times out.
  
  From client's view: Same as Scenario 1!
  But server already processed it!
  
  If client retries: DOUBLE PROCESSING ✗

The client cannot tell the difference between "request lost" and "response lost."

This is fundamental. No amount of timeout tuning fixes it. The only solution is making operations safe to retry.

1.4 HTTP Methods and Idempotency

HTTP defines some methods as idempotent by design:

Method Idempotent? Why
GET Yes Reading doesn't change state
PUT Yes "Set resource to this value" is repeatable
DELETE Yes "Delete resource" — deleting twice = still deleted
HEAD Yes Same as GET, no body
OPTIONS Yes Just asking about capabilities
POST No "Create new resource" — creates duplicates!
PATCH Depends Could be "add $10" (no) or "set price to $10" (yes)

POST is where the danger lives. And most APIs use POST for important operations like payments.


Chapter 2: The Idempotency Key Pattern

2.1 The Core Idea

The client generates a unique identifier for each logical operation. The server remembers this identifier and its result.

First request:
  Client → POST /payments
           Idempotency-Key: pay_abc123
           {amount: 99.00}
           
  Server: "I've never seen pay_abc123 before"
          Process payment
          Store: pay_abc123 → {status: success, id: txn_789}
          
  Server → 200 OK {status: success, id: txn_789}

Second request (retry):
  Client → POST /payments
           Idempotency-Key: pay_abc123  (same key!)
           {amount: 99.00}
           
  Server: "I've seen pay_abc123 before!"
          Look up stored result
          
  Server → 200 OK {status: success, id: txn_789}  (same response!)

The payment only happened once, but the client got the confirmation it needed.

2.2 The Mental Model: The Coat Check

Think of a coat check at a restaurant:

You arrive at restaurant:
  You: "Here's my coat" (request)
  Attendant: Takes coat, gives you ticket #42 (idempotency key)
  
Later, you're not sure if you got a ticket:
  You: "Here's my coat again" + ticket #42
  Attendant: "I already have your coat for #42, here's the same ticket"
  
You don't end up checking your coat twice.
The ticket is proof of the operation.

2.3 Implementation: Basic Version

import hashlib
import json
import redis
from typing import Optional, Tuple
from dataclasses import dataclass
from datetime import timedelta

@dataclass
class IdempotencyRecord:
    key: str
    request_hash: str  # To detect conflicting requests
    status: str        # 'processing', 'completed', 'failed'
    response: Optional[dict]
    created_at: float

class IdempotencyStore:
    """
    Stores idempotency keys and their results.
    Uses Redis for fast lookups and automatic expiration.
    """
    
    def __init__(self, redis_client: redis.Redis, ttl_hours: int = 24):
        self.redis = redis_client
        self.ttl = timedelta(hours=ttl_hours)
    
    def _make_redis_key(self, idempotency_key: str) -> str:
        return f"idempotency:{idempotency_key}"
    
    def _hash_request(self, request_body: dict) -> str:
        """Create hash of request to detect conflicting retries."""
        serialized = json.dumps(request_body, sort_keys=True)
        return hashlib.sha256(serialized.encode()).hexdigest()[:16]
    
    def check_and_set(
        self,
        idempotency_key: str,
        request_body: dict
    ) -> Tuple[bool, Optional[dict]]:
        """
        Check if key exists. If not, claim it.
        
        Returns:
            (is_new, existing_response)
            - (True, None) if this is a new key
            - (False, response) if key exists with completed response
            - Raises ConflictError if key exists with different request
        """
        redis_key = self._make_redis_key(idempotency_key)
        request_hash = self._hash_request(request_body)
        
        # Try to get existing record
        existing = self.redis.get(redis_key)
        
        if existing:
            record = json.loads(existing)
            
            # Check if request matches
            if record['request_hash'] != request_hash:
                raise IdempotencyConflictError(
                    f"Key {idempotency_key} used with different request"
                )
            
            # If completed, return cached response
            if record['status'] == 'completed':
                return (False, record['response'])
            
            # If still processing, client should wait
            if record['status'] == 'processing':
                raise IdempotencyInProgressError(
                    f"Request {idempotency_key} is still being processed"
                )
            
            # If failed, allow retry
            if record['status'] == 'failed':
                # Update to processing and allow retry
                record['status'] = 'processing'
                self.redis.setex(
                    redis_key,
                    self.ttl,
                    json.dumps(record)
                )
                return (True, None)
        
        # New key - claim it
        record = {
            'key': idempotency_key,
            'request_hash': request_hash,
            'status': 'processing',
            'response': None,
            'created_at': time.time()
        }
        
        # Use SETNX for atomic claim (only set if not exists)
        claimed = self.redis.setnx(redis_key, json.dumps(record))
        
        if not claimed:
            # Race condition - another request claimed it first
            # Recurse to handle the existing record
            return self.check_and_set(idempotency_key, request_body)
        
        # Set TTL
        self.redis.expire(redis_key, self.ttl)
        
        return (True, None)
    
    def complete(self, idempotency_key: str, response: dict):
        """Mark request as completed with response."""
        redis_key = self._make_redis_key(idempotency_key)
        
        existing = self.redis.get(redis_key)
        if not existing:
            raise ValueError(f"No record for key {idempotency_key}")
        
        record = json.loads(existing)
        record['status'] = 'completed'
        record['response'] = response
        
        self.redis.setex(redis_key, self.ttl, json.dumps(record))
    
    def fail(self, idempotency_key: str, error: str):
        """Mark request as failed."""
        redis_key = self._make_redis_key(idempotency_key)
        
        existing = self.redis.get(redis_key)
        if not existing:
            return
        
        record = json.loads(existing)
        record['status'] = 'failed'
        record['response'] = {'error': error}
        
        self.redis.setex(redis_key, self.ttl, json.dumps(record))


class IdempotencyConflictError(Exception):
    """Raised when idempotency key is reused with different request."""
    pass

class IdempotencyInProgressError(Exception):
    """Raised when request is still being processed."""
    pass

2.4 Using the Idempotency Store

from typing import Callable

def with_idempotency(
    idempotency_store: IdempotencyStore,
    idempotency_key: str,
    request_body: dict,
    process_func: Callable[[], dict]
) -> dict:
    """
    Execute function with idempotency protection.
    
    If key was seen before with same request, return cached response.
    If key is new, execute function and cache result.
    """
    
    # Check if we've seen this request
    is_new, cached_response = idempotency_store.check_and_set(
        idempotency_key,
        request_body
    )
    
    if not is_new:
        # Return cached response (this is a retry)
        return cached_response
    
    # Process the request
    try:
        response = process_func()
        idempotency_store.complete(idempotency_key, response)
        return response
    
    except Exception as e:
        idempotency_store.fail(idempotency_key, str(e))
        raise

# Usage in payment handler
@app.post('/payments')
def create_payment(request):
    idempotency_key = request.headers.get('Idempotency-Key')
    
    if not idempotency_key:
        return Response(status=400, body="Idempotency-Key header required")
    
    def process():
        return payment_service.process_payment(
            user_id=request.json['user_id'],
            amount=request.json['amount']
        )
    
    try:
        result = with_idempotency(
            idempotency_store,
            idempotency_key,
            request.json,
            process
        )
        return Response(status=200, body=result)
    
    except IdempotencyConflictError:
        return Response(status=422, body="Idempotency key reused with different request")
    
    except IdempotencyInProgressError:
        return Response(status=409, body="Request still processing, please wait")

Chapter 3: Client-Generated vs Server-Generated Keys

3.1 Client-Generated Keys

The client creates the idempotency key and sends it with the request.

# Client code
import uuid

def pay(amount: float):
    idempotency_key = f"pay_{uuid.uuid4()}"
    
    response = requests.post(
        '/payments',
        headers={'Idempotency-Key': idempotency_key},
        json={'amount': amount}
    )
    
    if response.status_code in [500, 502, 503, 504] or response is timeout:
        # Safe to retry with same key
        response = requests.post(
            '/payments',
            headers={'Idempotency-Key': idempotency_key},  # Same key!
            json={'amount': amount}
        )
    
    return response

Pros:

  • Client controls retry behavior
  • Works across client restarts (if key is persisted)
  • No server state needed before first request

Cons:

  • Client might generate poor keys (duplicates, predictable)
  • Client might forget the key and retry with a new one (defeats purpose)
  • Requires client-side state management

3.2 Server-Generated Keys

Server provides a key that client uses for subsequent operations.

# Step 1: Client requests a payment intent
response = requests.post('/payment-intents', json={'amount': 99.00})
payment_intent_id = response.json()['id']  # "pi_abc123"

# Step 2: Client confirms payment (can retry safely)
response = requests.post(
    f'/payment-intents/{payment_intent_id}/confirm',
    json={'payment_method': 'card_xyz'}
)

# If step 2 times out, client can retry with same payment_intent_id
# Server knows whether confirmation already happened

Pros:

  • Server controls key format and uniqueness
  • Natural fit for multi-step workflows
  • Client can't generate bad keys

Cons:

  • Requires extra round-trip to get key
  • Server must store intent before payment

3.3 Hybrid: Deterministic Keys from Request

Derive the key from request content:

import hashlib

def generate_idempotency_key(user_id: str, order_id: str, amount: float) -> str:
    """Generate deterministic key from request parameters."""
    content = f"{user_id}:{order_id}:{amount}"
    return hashlib.sha256(content.encode()).hexdigest()[:32]

# Same order always generates same key
key = generate_idempotency_key("user_123", "order_456", 99.00)
# → "a7b3c9d2e1f0..."

Pros:

  • No state needed on client
  • Same logical operation always uses same key
  • Natural deduplication

Cons:

  • Can't distinguish intentional duplicates
  • User buying same item twice = same key = blocked!

Solution: Add timestamp bucket

from datetime import datetime

def generate_idempotency_key(user_id: str, order_id: str, amount: float) -> str:
    # Include hour to allow same purchase later
    hour = datetime.now().strftime("%Y%m%d%H")
    content = f"{user_id}:{order_id}:{amount}:{hour}"
    return hashlib.sha256(content.encode()).hexdigest()[:32]

3.4 Which to Choose?

Scenario Recommendation
Simple API with savvy clients Client-generated UUID
Public API with unknown clients Server-generated (payment intents)
Internal microservices Deterministic from request
User-facing buttons (buy, submit) Server-generated or deterministic

Chapter 4: TTL and Deduplication Windows

4.1 How Long to Remember?

Idempotency records can't live forever:

  • Storage cost grows unbounded
  • Old keys might collide with new ones
  • You need to allow legitimate re-submission eventually

But too short a TTL defeats the purpose:

  • Request times out at 30 seconds
  • TTL is 10 seconds
  • Client retries at 35 seconds → key expired → double charge!
Timeline:

0s:    Client sends request
3.5s:  Request times out (from Day 1: our bank timeout)
5s:    Client retries → should be deduplicated

30s:   Client gives up
31s:   Original request finally completes at bank (very slow!)

If TTL = 60s:
  - 5s retry: deduplicated ✓
  - 31s completion: recorded ✓
  - User refreshes page at 45s: sees success ✓

If TTL = 10s:
  - 5s retry: deduplicated ✓
  - 31s completion: but record is gone!
  - Orphan transaction, no record ✗

4.2 TTL Guidelines

# Conservative recommendations

IDEMPOTENCY_TTL = {
    # For synchronous APIs (user waiting)
    'payment': timedelta(hours=24),
    
    # For async operations (processed later)
    'batch_job': timedelta(days=7),
    
    # For webhooks (external systems retry for days)
    'webhook': timedelta(days=3),
    
    # For internal services (known retry behavior)
    'internal_rpc': timedelta(hours=1),
}

Rule of thumb: TTL should be at least 10× your maximum retry window.

4.3 What to Store

@dataclass
class IdempotencyRecord:
    # Required fields
    key: str                    # The idempotency key
    status: str                 # processing, completed, failed
    response: Optional[dict]    # Cached response to return
    created_at: datetime        # When first seen
    
    # Recommended fields
    request_hash: str           # To detect conflicting requests
    completed_at: Optional[datetime]  # When completed
    
    # Optional but useful
    user_id: Optional[str]      # Who made the request
    request_path: str           # Which endpoint
    request_body: dict          # Original request (for debugging)
    processing_time_ms: int     # How long it took

4.4 Storage Options

Store Pros Cons
Redis Fast, built-in TTL Memory cost, persistence concerns
PostgreSQL Durable, familiar Slower, manual TTL cleanup
DynamoDB Managed, TTL support Cost at high volume
In-memory Fastest Lost on restart, not distributed

For payments: Use durable storage (PostgreSQL/DynamoDB) with Redis cache in front.


Chapter 5: The "Network Timeout" Problem

5.1 Yesterday's Unsolved Problem

From Day 1, our payment service:

async def _charge_bank(self, budget, user_id, amount):
    try:
        response = await client.post(BANK_URL, json={...}, timeout=timeout)
        return PaymentResult(status=SUCCESS, transaction_id=response['id'])
    
    except TimeoutError:
        # ← This is the problem!
        # Did the bank charge the user or not?
        return PaymentResult(
            status=ERROR,
            error_message='Payment processing timed out'
        )

The user sees "timed out" and might retry. The bank might have charged them.

5.2 The Solution: Pre-Register Before Calling Bank

New flow:

1. Receive payment request with idempotency key
2. Check if key exists → return cached result if so
3. Create idempotency record with status='processing'
4. Call bank API
5. Whether success or failure, update idempotency record
6. Return result

On timeout:
- Idempotency record exists with status='processing'
- We don't know the outcome
- Client retries:
  - We see status='processing'
  - We check with bank: "Did transaction X complete?"
  - Update record based on bank's answer
  - Return consistent result

5.3 Implementation

class PaymentServiceWithIdempotency:
    """
    Payment service from Day 1, now with idempotency.
    """
    
    def __init__(self, config: PaymentConfig, idempotency_store: IdempotencyStore):
        self.config = config
        self.idempotency = idempotency_store
        self.logger = logging.getLogger('payment_service')
    
    async def process_payment(
        self,
        user_id: str,
        amount: float,
        idempotency_key: str
    ) -> PaymentResult:
        """
        Process payment with idempotency protection.
        Safe to call multiple times with same idempotency_key.
        """
        
        request_body = {'user_id': user_id, 'amount': amount}
        
        # Step 1: Check idempotency
        try:
            is_new, cached = self.idempotency.check_and_set(
                idempotency_key,
                request_body
            )
            
            if not is_new:
                self.logger.info(f"Returning cached result for {idempotency_key}")
                return PaymentResult(**cached)
        
        except IdempotencyInProgressError:
            # Previous request still processing
            # Could be: our timeout fired, but bank is still working
            return await self._handle_in_progress(idempotency_key, request_body)
        
        except IdempotencyConflictError as e:
            return PaymentResult(status=PaymentStatus.ERROR, error_message=str(e))
        
        # Step 2: Process payment (now safe to proceed)
        try:
            result = await self._do_payment(user_id, amount, idempotency_key)
            
            # Step 3: Record success
            self.idempotency.complete(idempotency_key, result.__dict__)
            return result
        
        except Exception as e:
            # Step 3: Record failure
            self.idempotency.fail(idempotency_key, str(e))
            raise
    
    async def _do_payment(
        self,
        user_id: str,
        amount: float,
        idempotency_key: str
    ) -> PaymentResult:
        """
        Actual payment processing (from Day 1).
        """
        budget = TimeoutBudget(self.config.total_budget_ms)
        
        # Fraud check
        fraud_result = await self._check_fraud(budget, user_id, amount)
        if fraud_result.status != PaymentStatus.SUCCESS:
            return fraud_result
        
        # Bank charge (pass idempotency key to bank!)
        bank_result = await self._charge_bank(
            budget, user_id, amount, idempotency_key
        )
        
        if bank_result.status == PaymentStatus.SUCCESS:
            # Non-critical notification
            asyncio.create_task(
                self._send_notification(user_id, amount, bank_result.transaction_id)
            )
        
        return bank_result
    
    async def _charge_bank(
        self,
        budget: TimeoutBudget,
        user_id: str,
        amount: float,
        idempotency_key: str
    ) -> PaymentResult:
        """
        Charge bank with idempotency.
        Even if we timeout, we can recover.
        """
        config = self.config.bank_api
        
        try:
            timeout = budget.get_timeout(config.read_timeout * 1000)
            
            # Pass our idempotency key to bank
            # Bank will also deduplicate on their end
            response = await self.http_client.post(
                f"{config.url}/charge",
                headers={'Idempotency-Key': idempotency_key},
                json={'user_id': user_id, 'amount': amount},
                timeout=timeout
            )
            
            return PaymentResult(
                status=PaymentStatus.SUCCESS,
                transaction_id=response.json()['transaction_id']
            )
        
        except TimeoutError:
            # This is where Day 1 left us stuck!
            # Now we can handle it:
            return await self._recover_from_timeout(idempotency_key, user_id, amount)
    
    async def _recover_from_timeout(
        self,
        idempotency_key: str,
        user_id: str,
        amount: float
    ) -> PaymentResult:
        """
        Handle bank timeout by checking transaction status.
        """
        self.logger.warning(f"Bank timeout for {idempotency_key}, checking status")
        
        try:
            # Ask bank: "Did this transaction complete?"
            status_response = await self.http_client.get(
                f"{self.config.bank_api.url}/transactions",
                params={'idempotency_key': idempotency_key},
                timeout=5.0  # Short timeout for status check
            )
            
            data = status_response.json()
            
            if data.get('found'):
                # Transaction exists at bank
                if data['status'] == 'completed':
                    return PaymentResult(
                        status=PaymentStatus.SUCCESS,
                        transaction_id=data['transaction_id']
                    )
                elif data['status'] == 'failed':
                    return PaymentResult(
                        status=PaymentStatus.ERROR,
                        error_message=data.get('error', 'Bank declined')
                    )
                else:
                    # Still processing
                    return PaymentResult(
                        status=PaymentStatus.ERROR,
                        error_message='Payment is still processing. Please check back shortly.'
                    )
            else:
                # Transaction not found - request never reached bank
                # Safe to return error and allow retry
                return PaymentResult(
                    status=PaymentStatus.ERROR,
                    error_message='Payment could not be processed. Please try again.'
                )
        
        except Exception as e:
            # Can't even check status - be honest with user
            self.logger.error(f"Status check failed: {e}")
            return PaymentResult(
                status=PaymentStatus.ERROR,
                error_message='Unable to confirm payment status. Please check your statement before retrying.'
            )
    
    async def _handle_in_progress(
        self,
        idempotency_key: str,
        request_body: dict
    ) -> PaymentResult:
        """
        Handle case where previous request is still processing.
        This happens when our timeout fired but bank is still working.
        """
        self.logger.info(f"Request {idempotency_key} in progress, waiting")
        
        # Poll for completion
        for _ in range(10):  # Try for 10 seconds
            await asyncio.sleep(1)
            
            is_new, cached = self.idempotency.check_and_set(
                idempotency_key,
                request_body
            )
            
            if not is_new and cached:
                return PaymentResult(**cached)
        
        # Still processing after 10 seconds
        return PaymentResult(
            status=PaymentStatus.ERROR,
            error_message='Payment is taking longer than expected. Please check back shortly.'
        )

Part II: The Design Challenge

Chapter 6: The Double-Click Problem

6.1 The Scenario

User experience:

1. User fills out payment form
2. User clicks "Pay Now"
3. Spinner appears
4. After 2 seconds, nothing visible happens
5. User thinks: "Did it work?" → clicks "Pay Now" again
6. System receives two payment requests within milliseconds

Without idempotency:
  Request 1: Starts processing
  Request 2: Also starts processing
  Result: User charged twice!

With idempotency:
  Request 1: Creates idempotency record, starts processing
  Request 2: Sees idempotency record → returns "in progress" or waits
  Result: User charged once ✓

6.2 The Full Timeline

Let's trace through every scenario:

Timeline with idempotency:

T=0.000s:  User clicks "Pay Now"
           Client generates: idempotency_key = "order_123_pay_1"
           Request 1 sent

T=0.001s:  Server receives Request 1
           Check idempotency store: not found
           Create record: {key: "order_123_pay_1", status: "processing"}
           Start fraud check

T=2.000s:  User clicks "Pay Now" again (impatient)
           Client sends Request 2 with SAME idempotency_key
           
T=2.001s:  Server receives Request 2
           Check idempotency store: found, status = "processing"
           Return: "Request in progress, please wait"
           Client shows: "Still processing..."

T=3.000s:  Request 1 completes fraud check, calls bank

T=3.500s:  Bank responds: success!
           Update record: {status: "completed", response: {...}}
           Request 1 returns: success

T=3.600s:  Client (from Request 2) retries
           Check idempotency store: found, status = "completed"
           Return: cached response (success)

T=3.700s:  User sees: "Payment successful!"
           Only charged once ✓

6.3 The Response-Lost Scenario

Timeline when response is lost:

T=0.000s:  Client sends Request 1
T=0.001s:  Server creates idempotency record
T=3.000s:  Server completes payment, sends response
T=3.001s:  Response packet is lost in network
T=5.000s:  Client timeout fires → shows error to user
T=5.100s:  User clicks "Try Again"

T=5.200s:  Client sends Request 2 (same idempotency key)
T=5.201s:  Server checks idempotency store: found, status = "completed"
           Return: cached response (success!)

T=5.300s:  User sees: "Payment successful!"
           Only charged once, despite the "error" ✓

6.4 The Server-Crash Scenario

Timeline when server crashes:

T=0.000s:  Client sends Request 1
T=0.001s:  Server creates idempotency record: status = "processing"
T=1.000s:  Server crashes mid-processing!

T=5.000s:  Client timeout → user retries
T=5.100s:  Request 2 hits DIFFERENT server
T=5.101s:  Check idempotency store: found, status = "processing"
           But it's been 5 seconds... something is wrong

Two options:

Option A: Strict (reject)
  Return: "Previous request may still be processing"
  User must check statement manually

Option B: Recovery (check and retry)
  Check bank: "Did this transaction complete?"
  Bank: "No record of it"
  Reset idempotency record, process new request
  
Option B is better UX but requires bank lookup capability

Chapter 7: Edge Cases That Will Bite You

7.1 Edge Case 1: Key Reuse with Different Request

# Dangerous: Same key, different amount

# Request 1
POST /payments
Idempotency-Key: key_abc
{amount: 100}

# Request 2 (bug or malicious)
POST /payments
Idempotency-Key: key_abc
{amount: 1000000}  # Much larger!

# What happens?

Solution: Hash the request body and compare

def check_and_set(self, key: str, request_body: dict):
    request_hash = hash_request(request_body)
    
    existing = self.get(key)
    if existing:
        if existing.request_hash != request_hash:
            raise IdempotencyConflictError(
                "Key reused with different request parameters"
            )
        # ... rest of logic

7.2 Edge Case 2: Key Expiration Race

T=0:      Request starts, idempotency record created
T=23h59m: Request finally completes (extreme case)
T=24h:    TTL expires, record deleted
T=24h30s: Response sent to client
T=24h35s: Client receives timeout, retries
T=24h36s: No idempotency record found → processes again!

User charged twice, 24 hours apart.

Solution: Extend TTL on completion

def complete(self, key: str, response: dict):
    record = self.get(key)
    record.status = 'completed'
    record.response = response
    
    # Extend TTL to ensure long-lived record for completed transactions
    extended_ttl = max(self.ttl, timedelta(hours=24))
    self.store(key, record, ttl=extended_ttl)

7.3 Edge Case 3: Concurrent Requests with Same Key

T=0.000s: Request A creates record, starts processing
T=0.001s: Request B checks record, sees "processing"
T=0.002s: Request A fails, marks record as "failed"
T=0.003s: Request B... what should it do?

Option A: Return the failure
  But Request B didn't fail - it never ran!
  
Option B: Allow Request B to retry
  Better - Request B gets a chance

Solution: Only completed requests block retries

def check_and_set(self, key: str, request_body: dict):
    existing = self.get(key)
    
    if existing:
        if existing.status == 'completed':
            # Definitely don't retry - return cached success
            return (False, existing.response)
        
        elif existing.status == 'processing':
            # Still running - make caller wait
            raise IdempotencyInProgressError()
        
        elif existing.status == 'failed':
            # Previous attempt failed - allow retry
            existing.status = 'processing'
            self.store(key, existing)
            return (True, None)
    
    # New key
    self.create(key, request_body)
    return (True, None)

7.4 Edge Case 4: Partial Completion

Payment flow:
1. Charge customer ✓
2. Credit merchant ← fails!
3. Send receipt

What's the idempotency status?
- "completed" would be wrong (merchant not credited)
- "failed" would allow retry (double charge customer!)

Solution: Atomic operations or saga pattern

async def process_payment(self, ...):
    # Begin transaction
    async with self.db.transaction():
        # Charge customer
        charge_result = await self.charge_customer(...)
        
        # Credit merchant
        credit_result = await self.credit_merchant(...)
        
        # Both succeed or both fail (atomic)

Or with saga pattern (Day 4 webhook topic):

async def process_payment(self, ...):
    charge_result = await self.charge_customer(...)
    
    try:
        credit_result = await self.credit_merchant(...)
    except:
        # Compensating action
        await self.refund_customer(charge_result.id)
        raise

Chapter 8: Idempotency Key Design

8.1 Key Format Recommendations

# Good key formats:

# UUID (random, unique)
"550e8400-e29b-41d4-a716-446655440000"

# Prefixed UUID (self-documenting)
"pay_550e8400-e29b-41d4-a716-446655440000"

# Entity-based (deterministic)
"user_123_order_456_v1"

# Timestamp-bucketed (allows intentional retries after window)
"user_123_order_456_2024010112"  # Includes hour

# Bad key formats:

# Sequential (predictable, vulnerable)
"1", "2", "3"

# User-controlled (can be malicious)
"../../../etc/passwd"

# Too short (collision risk)
"abc"

# Timestamp only (same user, same millisecond = collision)
"1704067200000"

8.2 Key Generation Best Practices

import uuid
import hashlib
from datetime import datetime

class IdempotencyKeyGenerator:
    """Generate idempotency keys with various strategies."""
    
    @staticmethod
    def random() -> str:
        """Random UUID - use when client controls retry."""
        return f"key_{uuid.uuid4()}"
    
    @staticmethod
    def deterministic(user_id: str, action: str, entity_id: str) -> str:
        """Deterministic key - same input = same key."""
        content = f"{user_id}:{action}:{entity_id}"
        return hashlib.sha256(content.encode()).hexdigest()[:32]
    
    @staticmethod
    def time_bucketed(
        user_id: str,
        action: str,
        entity_id: str,
        bucket_minutes: int = 60
    ) -> str:
        """Allow same action after time window."""
        now = datetime.utcnow()
        bucket = now.replace(
            minute=(now.minute // bucket_minutes) * bucket_minutes,
            second=0,
            microsecond=0
        )
        content = f"{user_id}:{action}:{entity_id}:{bucket.isoformat()}"
        return hashlib.sha256(content.encode()).hexdigest()[:32]
    
    @staticmethod
    def composite(prefix: str, *parts: str) -> str:
        """Readable composite key."""
        safe_parts = [p.replace('_', '-') for p in parts]
        return f"{prefix}_{'_'.join(safe_parts)}"

# Usage examples:

# For user-initiated payment (use random)
key = IdempotencyKeyGenerator.random()
# → "key_550e8400-e29b-41d4-a716-446655440000"

# For order payment (use deterministic)
key = IdempotencyKeyGenerator.deterministic("user_123", "pay", "order_456")
# → "a1b2c3d4e5f6..."

# For daily charge (use time bucketed - daily)
key = IdempotencyKeyGenerator.time_bucketed("user_123", "subscription", "sub_789", 1440)
# → "x1y2z3..." (changes each day)

# For webhook (use composite for readability)
key = IdempotencyKeyGenerator.composite("webhook", "order_created", "order_456")
# → "webhook_order-created_order-456"

8.3 Key Validation

import re

class IdempotencyKeyValidator:
    """Validate idempotency keys."""
    
    MIN_LENGTH = 16
    MAX_LENGTH = 256
    PATTERN = re.compile(r'^[a-zA-Z0-9_\-]+$')
    
    @classmethod
    def validate(cls, key: str) -> tuple[bool, str]:
        """
        Validate key format.
        Returns (is_valid, error_message).
        """
        if not key:
            return False, "Idempotency key is required"
        
        if len(key) < cls.MIN_LENGTH:
            return False, f"Key too short (min {cls.MIN_LENGTH} chars)"
        
        if len(key) > cls.MAX_LENGTH:
            return False, f"Key too long (max {cls.MAX_LENGTH} chars)"
        
        if not cls.PATTERN.match(key):
            return False, "Key contains invalid characters"
        
        return True, ""

# Usage in middleware
@app.middleware
def validate_idempotency_key(request, call_next):
    key = request.headers.get('Idempotency-Key')
    
    if request.method == 'POST':  # Only required for non-idempotent methods
        if not key:
            return Response(status=400, body="Idempotency-Key header required")
        
        is_valid, error = IdempotencyKeyValidator.validate(key)
        if not is_valid:
            return Response(status=400, body=error)
    
    return call_next(request)

Part III: The Document Challenge

Chapter 9: Your Idempotency Strategy Document

Here's a template for documenting your idempotency strategy:

# Payment Idempotency Strategy

## Overview

All payment operations MUST be idempotent. Users can safely retry any
payment operation without risk of duplicate charges.

## Idempotency Key Format

### Client-Generated Keys (API Consumers)
- Format: `pay_<uuid4>`
- Example: `pay_550e8400-e29b-41d4-a716-446655440000`
- Required header: `Idempotency-Key: <key>`
- Validation: 16-256 alphanumeric characters plus `-` and `_`

### Server-Generated Keys (Internal)
- Format: `<action>_<entity>_<hash>`
- Example: `charge_order_abc123_v1`
- Generated from: user_id + order_id + amount

## Storage

- Primary: PostgreSQL `idempotency_keys` table
- Cache: Redis with 1-hour TTL
- Record TTL: 7 days for completed, 24 hours for failed

## Deduplication Window

| Scenario | Window |
|----------|--------|
| User retry (network timeout) | Immediate |
| Accidental double-click | Immediate |
| Page refresh | 1 hour |
| Intentional re-purchase | After 24 hours |

## Edge Cases

### 1. Same Key, Different Request
- Detection: Hash request body, compare on lookup
- Response: HTTP 422 "Idempotency key reused with different parameters"

### 2. Request In Progress
- Detection: Record exists with status='processing'
- Response: HTTP 409 "Request is being processed"
- Client action: Retry with exponential backoff

### 3. Previous Request Failed
- Detection: Record exists with status='failed'
- Response: Allow new attempt (reset status to 'processing')

### 4. Key Expired
- Detection: Record not found (TTL expired)
- Response: Process as new request
- Risk: Potential duplicate if original eventually completes
- Mitigation: Check bank API for existing transaction

## Reconciliation

Daily job to detect inconsistencies:
1. Find payments where our record differs from bank
2. Find orphaned transactions (bank charged, we have no record)
3. Alert on any discrepancies for manual review

## Monitoring

### Alerts
- Duplicate attempt rate > 5% (user experience issue)
- Conflict rate > 0.1% (potential bug or attack)
- Orphaned transaction detected (immediate page)

### Metrics
- `idempotency_cache_hit_rate` - Should be > 95%
- `idempotency_duplicate_prevented_total` - Growing is good
- `idempotency_conflict_total` - Should be near zero

Part IV: Discussion and Trade-offs

Chapter 10: The Hard Questions

10.1 "User clicks pay twice. Network timeout on first request—did it go through?"

Strong Answer (connecting to Day 1):

"This is exactly why we need idempotency, and why timeouts alone don't solve the problem.

When the first request times out, we don't know if the bank processed it. From Day 1, we learned that a timeout could mean:

  1. Request never reached bank
  2. Bank processed it, response was lost
  3. Bank is still processing

With idempotency:

Before calling bank:

  • Client sends idempotency key: pay_abc123
  • Server records: {key: pay_abc123, status: processing}

Timeout occurs:

  • Server doesn't know outcome
  • Returns: 'Payment processing timed out. Please check your statement.'

User clicks again (same key):

  • Server sees: status = 'processing'
  • Server asks bank: 'What happened to pay_abc123?'
  • Bank says: 'Completed' or 'Not found'
  • Server updates record and returns appropriate response

The key insight is: we pass our idempotency key TO the bank. The bank also deduplicates. So even if we call twice, they only charge once. And we can query by that key to find out what happened."

10.2 "How long do you keep idempotency records?"

Strong Answer:

"It depends on the retry pattern and compliance requirements.

Minimum: At least 10× your maximum timeout. If your timeout is 30 seconds and clients retry 3 times, that's 90 seconds. Keep records for at least 15 minutes.

Practical default: 24 hours covers most user retry scenarios:

  • Immediate double-click: covered
  • Refresh page after 5 minutes: covered
  • Come back next day and retry: NOT covered (intentional)

For financial transactions: Often 7+ days for compliance. You need to prove no duplicates occurred.

Storage strategy:

Redis (hot): 1 hour TTL - fast lookups for recent requests
PostgreSQL (cold): 30 day retention - audit trail and reconciliation
Archive (frozen): 7 years - regulatory compliance

The TTL tradeoff:

  • Too short: Legitimate retries create duplicates
  • Too long: Storage costs, stale data, potential key collisions

I'd rather err on the side of too long—storage is cheap, duplicate charges are expensive."

10.3 "Client-generated vs server-generated idempotency keys—which is better?"

Strong Answer:

"Neither is universally better. The right choice depends on the client and use case.

Client-generated (UUID):

  • Best for: API clients you trust (internal services, partner integrations)
  • Pros: Client controls retry logic, no extra round-trip
  • Cons: Client might generate weak keys or lose the key

Server-generated (payment intents):

  • Best for: Public APIs, web/mobile frontends
  • Pros: Server ensures key quality, natural multi-step flow
  • Cons: Extra round-trip, more server state

Deterministic (from request):

  • Best for: Internal microservices, known request shapes
  • Pros: No state to manage, same request = same key automatically
  • Cons: Can't distinguish intentional duplicates

For a public payment API like Stripe, I'd use server-generated (payment intents pattern). Create the intent, get an ID, then confirm with that ID. The ID serves as both the idempotency key and the resource reference.

For internal service-to-service calls, I'd use deterministic keys derived from the request. The calling service doesn't need to manage state, and duplicates are naturally deduplicated."


Chapter 11: Session Summary

What You Should Know Now

After this session, you should be able to:

  1. Explain why idempotency matters — Networks lose responses, users double-click
  2. Implement idempotency keys — Store, check, and return cached responses
  3. Handle the timeout problem — Status checks, recovery flows
  4. Design key strategies — Client vs server generated, TTL choices
  5. Document your strategy — Edge cases, monitoring, reconciliation

Connection to Day 1

Yesterday: "How do I stop waiting forever?" → Timeouts Today: "What if my timeout fires but the operation succeeded?" → Idempotency

Together, they solve the reliability problem:

  • Timeouts prevent resource exhaustion
  • Idempotency prevents duplicate effects
  • Your system is both responsive AND correct

Key Trade-offs

Decision Trade-off
Shorter TTL Less storage vs Risk of duplicates after expiry
Client keys Simpler API vs Trust client implementation
Server keys More control vs Extra round-trip
Strict matching Detect conflicts vs Reject legitimate retries

Questions to Ask in Every Design

  1. What operations are non-idempotent? (Usually writes)
  2. How will clients retry? (Same key or new key?)
  3. What's the deduplication window? (Minutes? Hours? Days?)
  4. How do we recover from ambiguous failures? (Status check APIs?)
  5. How do we reconcile mismatches? (Audit logs, daily jobs?)

Part V: Interview Questions and Answers

Chapter 12: Real-World Interview Scenarios

12.1 Conceptual Questions

Question 1: "What is idempotency and why is it important?"

Interviewer's Intent: Testing foundational understanding.

Strong Answer:

"Idempotency means an operation can be performed multiple times with the same effect as performing it once. It's crucial in distributed systems because networks are unreliable.

Consider a payment: the user clicks 'Pay', the bank charges them, but the response is lost. From the user's perspective, it failed. They click again. Without idempotency, they're charged twice.

The fundamental problem is that a client cannot distinguish between 'request never arrived' and 'response was lost.' Both look like timeout. Idempotency makes retry safe by remembering what we've processed.

HTTP methods like GET, PUT, DELETE are idempotent by definition. POST typically is not, which is why we add idempotency keys to POST endpoints for critical operations like payments."


Question 2: "Explain the difference between at-most-once, at-least-once, and exactly-once delivery."

Interviewer's Intent: Testing distributed systems knowledge.

Strong Answer:

"These describe message delivery guarantees:

At-most-once: Fire and forget. Send message, don't check if received. Message may be lost. Example: UDP, fire-and-forget events.

At-least-once: Retry until acknowledged. Message delivered one or more times. Duplicates possible. Example: most message queues with retry.

Exactly-once: Message delivered exactly one time. No loss, no duplicates.

Here's the key insight: exactly-once is impossible in a distributed system without cooperation from the receiver. What we actually implement is:

'At-least-once delivery with idempotent processing'

The sender retries until acknowledged (at-least-once). The receiver deduplicates (idempotent). The combination behaves like exactly-once from the application's perspective.

So when someone says their system supports exactly-once, they really mean: we handle duplicates for you."


Question 3: "How would you handle partial failures in an idempotent operation?"

Interviewer's Intent: Testing understanding of complex scenarios.

Strong Answer:

"Partial failures are tricky because the idempotency record might not reflect the true state.

Example: Payment flow

  1. Charge customer's card ✓
  2. Credit merchant's account ✗ (fails)
  3. Send receipt

The card was charged but merchant wasn't credited. What's our idempotency status?

Three approaches:

Approach 1: Atomic transactions Wrap all operations in a database transaction. Either all succeed or all fail. Idempotency record only written on full success.

with db.transaction():
    charge_customer()
    credit_merchant()
    idempotency.complete(key)  # Only reached if both succeed

Approach 2: Saga with compensation If step 2 fails, undo step 1 before marking failed:

charge_result = charge_customer()
try:
    credit_merchant()
except:
    refund_customer(charge_result)  # Compensating action
    idempotency.fail(key)
    raise
idempotency.complete(key)

Approach 3: State machine Track which steps completed. On retry, resume from last successful step:

state = idempotency.get_state(key)

if state.step < 1:
    charge_customer()
    idempotency.set_step(key, 1)

if state.step < 2:
    credit_merchant()
    idempotency.set_step(key, 2)

idempotency.complete(key)

The best approach depends on whether operations can be undone and whether they're in the same database."


12.2 Design Questions

Question 4: "Design the idempotency layer for a payment API."

Interviewer's Intent: Testing end-to-end design.

Strong Answer:

"I'll design this layer by layer.

API Contract:

POST /v1/payments
Headers:
  Idempotency-Key: <required, string, 16-256 chars>
Body:
  {amount, currency, customer_id, payment_method}
Response:
  {id, status, amount, created_at}

Key validation:

  • Required for all POST requests
  • 16-256 alphanumeric plus - and _
  • Reject reserved prefixes (system_, test_)

Storage design:

CREATE TABLE idempotency_records (
    key VARCHAR(256) PRIMARY KEY,
    request_hash CHAR(64),
    status VARCHAR(20),  -- processing, completed, failed
    response JSONB,
    created_at TIMESTAMP,
    completed_at TIMESTAMP,
    expires_at TIMESTAMP
);

CREATE INDEX idx_expires ON idempotency_records(expires_at);

Plus Redis cache for hot lookups.

Processing flow:

def process_payment(request):
    key = request.headers['Idempotency-Key']
    
    # 1. Check cache
    cached = redis.get(f"idem:{key}")
    if cached and cached.status == 'completed':
        return cached.response
    
    # 2. Check/claim in database
    with db.transaction():
        existing = db.select_for_update(key)
        
        if existing:
            if existing.status == 'completed':
                return existing.response
            if existing.status == 'processing':
                raise InProgressError()
            # status == 'failed': allow retry
        
        db.upsert(key, status='processing', request_hash=hash(request))
    
    # 3. Process payment
    try:
        result = bank_api.charge(...)
        
        # 4. Record success
        db.update(key, status='completed', response=result)
        redis.setex(f"idem:{key}", 3600, result)
        
        return result
    except Exception as e:
        db.update(key, status='failed')
        raise

Edge cases handled:

  • Double click: Second request sees 'processing', waits or returns 409
  • Retry after error: 'failed' status allows retry
  • Different request same key: Compare request_hash, return 422
  • TTL expiry: Cleanup job, but also check bank for existing transaction

Monitoring:

  • Cache hit rate: Should be high
  • Duplicate rate: Healthy system sees some (expected retries)
  • Conflict rate: Should be near zero (indicates client bugs)"

Question 5: "How would you implement idempotency for a multi-step workflow?"

Interviewer's Intent: Testing complex scenario handling.

Strong Answer:

"Multi-step workflows need idempotency at both workflow and step level.

Example: Booking flow

  1. Reserve inventory
  2. Charge payment
  3. Confirm booking
  4. Send confirmation email

Approach: State machine with step tracking

class BookingWorkflow:
    STEPS = ['reserve', 'charge', 'confirm', 'notify']
    
    def execute(self, idempotency_key: str, booking_data: dict):
        # Get or create workflow state
        state = self.get_state(idempotency_key)
        
        if state.status == 'completed':
            return state.result
        
        try:
            # Resume from last successful step
            for step in self.STEPS:
                if step not in state.completed_steps:
                    result = self.execute_step(step, booking_data, state)
                    state.completed_steps.append(step)
                    state.step_results[step] = result
                    self.save_state(idempotency_key, state)
            
            state.status = 'completed'
            self.save_state(idempotency_key, state)
            return state.result
            
        except Exception as e:
            state.status = 'failed'
            state.error = str(e)
            self.save_state(idempotency_key, state)
            raise
    
    def execute_step(self, step: str, data: dict, state: WorkflowState):
        step_key = f"{state.idempotency_key}_{step}"
        
        # Each step is also idempotent
        if step == 'reserve':
            return self.inventory.reserve(data['items'], idempotency_key=step_key)
        elif step == 'charge':
            return self.payments.charge(data['amount'], idempotency_key=step_key)
        # ... etc

Key design decisions:

  1. Workflow-level key: Identifies the entire booking
  2. Step-level keys: Derived from workflow key, make each step idempotent
  3. State persistence: Save after each step so we can resume
  4. Compensation: If step 3 fails, we might need to undo steps 1-2

For compensation:

def execute_with_compensation(self, ...):
    completed = []
    try:
        for step in self.STEPS:
            result = self.execute_step(step, ...)
            completed.append((step, result))
    except:
        # Rollback in reverse order
        for step, result in reversed(completed):
            self.compensate(step, result)
        raise

This is basically the Saga pattern, which we'll cover more in Day 4."


12.3 Scenario-Based Questions

Question 6: "User reports being charged twice. How do you investigate?"

Interviewer's Intent: Testing operational skills.

Strong Answer:

"I'd investigate systematically:

Step 1: Gather facts

  • Get user's account ID, approximate time
  • Get any transaction IDs they have (email receipts)

Step 2: Check our records

SELECT * FROM payments 
WHERE user_id = 'xyz' 
AND created_at > '2024-01-01'
ORDER BY created_at;

Look for:

  • Two transactions with same amount, close timestamps
  • Different transaction IDs = genuinely processed twice
  • Same transaction ID = UI bug showing twice

Step 3: Check idempotency records

SELECT * FROM idempotency_records
WHERE key LIKE 'pay_xyz_%'
AND created_at > '2024-01-01';

Look for:

  • Two different keys (user retried with new key = our client bug)
  • Same key, two completions (shouldn't happen, major bug)
  • Key conflict errors (user might have reused key incorrectly)

Step 4: Check bank/payment processor

  • Query their API for transactions
  • Compare their records to ours
  • They're the source of truth for actual charges

Step 5: Root cause

Likely causes:

  1. Client generated new key on retry: Fix client code
  2. Idempotency bypass: Some code path doesn't check idempotency
  3. TTL expired between attempts: Increase TTL
  4. Different API endpoints used: One idempotent, one not

Step 6: Remediation

  • If double charge confirmed: Refund one transaction
  • If our bug: Write postmortem, deploy fix
  • If client bug: Help them fix, document better

Prevention:

  • Add monitoring for duplicate-looking transactions
  • Alert when same user has multiple charges within X minutes
  • Reconciliation job comparing our records to bank"

Question 7: "How would you test idempotency implementation?"

Interviewer's Intent: Testing quality mindset.

Strong Answer:

"I'd test at multiple levels:

Unit tests: Core logic

def test_new_key_is_processed():
    store = IdempotencyStore()
    is_new, cached = store.check_and_set('key1', {'a': 1})
    assert is_new == True
    assert cached is None

def test_existing_completed_returns_cached():
    store = IdempotencyStore()
    store.check_and_set('key1', {'a': 1})
    store.complete('key1', {'result': 'ok'})
    
    is_new, cached = store.check_and_set('key1', {'a': 1})
    assert is_new == False
    assert cached == {'result': 'ok'}

def test_different_request_same_key_raises():
    store = IdempotencyStore()
    store.check_and_set('key1', {'a': 1})
    
    with pytest.raises(IdempotencyConflictError):
        store.check_and_set('key1', {'a': 2})  # Different request!

Integration tests: Full flow

def test_duplicate_payment_deduplicated():
    key = 'pay_test123'
    
    # First request
    resp1 = client.post('/payments', 
        headers={'Idempotency-Key': key},
        json={'amount': 100})
    assert resp1.status == 200
    
    # Second request (same key)
    resp2 = client.post('/payments',
        headers={'Idempotency-Key': key},
        json={'amount': 100})
    assert resp2.status == 200
    assert resp2.json() == resp1.json()
    
    # Verify only one charge
    charges = bank_mock.get_charges()
    assert len(charges) == 1

Concurrency tests

def test_concurrent_same_key():
    key = 'pay_concurrent'
    results = []
    
    def make_request():
        resp = client.post('/payments',
            headers={'Idempotency-Key': key},
            json={'amount': 100})
        results.append(resp)
    
    # Fire 10 requests simultaneously
    threads = [Thread(target=make_request) for _ in range(10)]
    [t.start() for t in threads]
    [t.join() for t in threads]
    
    # All should get same response
    assert len(set(r.json()['id'] for r in results)) == 1
    
    # Only one actual charge
    assert len(bank_mock.get_charges()) == 1

Failure injection tests

def test_timeout_then_retry():
    key = 'pay_timeout'
    
    # Make bank slow on first call
    bank_mock.set_delay(10)  # 10 seconds
    
    with pytest.raises(Timeout):
        client.post('/payments',
            headers={'Idempotency-Key': key},
            json={'amount': 100},
            timeout=1)
    
    # Bank actually processed it (slowly)
    bank_mock.set_delay(0)
    
    # Retry should return the result
    resp = client.post('/payments',
        headers={'Idempotency-Key': key},
        json={'amount': 100})
    
    assert resp.json()['status'] == 'success'
    assert len(bank_mock.get_charges()) == 1

Production verification

  • Deploy to staging, run load test with duplicates
  • Monitor: duplicate detection rate, conflict rate
  • Shadow mode: Log duplicates without deduplicating, compare to expected"

12.4 Deep-Dive Questions

Question 8: "How do systems like Stripe implement idempotency?"

Interviewer's Intent: Testing industry knowledge.

Strong Answer:

"Stripe's approach is well-documented and is the industry standard:

Key format:

  • Client provides Idempotency-Key header
  • Max 255 characters
  • Any string works, but they recommend UUIDs

Behavior:

  • Keys are scoped to API key (different accounts can use same key)
  • 24-hour TTL after first use
  • Replays must have identical request body
  • Returns cached response with original status code

Status tracking:

  • Requests in flight are tracked
  • Concurrent duplicate returns 409 with 'request in progress'
  • After completion, returns cached response

What makes it good:

  1. Request fingerprinting: They hash the request and compare. Same key + different body = error.

  2. Saved for 24 hours: Generous window for retries.

  3. Scoped to API key: Prevents conflicts between merchants.

  4. Works across their whole API: Consistent behavior everywhere.

Implementation insights from their blog:

They store:

  • Key
  • Request hash
  • Response (full, including status code)
  • Created at
  • API key ID

They handle racing requests with database locks, ensuring only one executes while others wait.

They recommend clients:

  • Generate UUID for each logical operation
  • Store key with the operation until confirmed
  • Retry with same key on network errors

This is the pattern I'd implement. It's battle-tested at massive scale."


Question 9: "Compare idempotency at the API layer vs the database layer."

Interviewer's Intent: Testing architectural thinking.

Strong Answer:

"Both layers can provide idempotency, but they solve different problems.

API layer idempotency:

Where: Application code before business logic What it catches: Duplicate API calls (user clicks twice, client retries) How: Check idempotency key, return cached response

@app.post('/payments')
def create_payment(request):
    if idempotency_exists(request.key):
        return cached_response(request.key)
    # ... process payment

Pros: Catches duplicates early, saves processing time Cons: Doesn't catch bugs in your own code

Database layer idempotency:

Where: Database constraints What it catches: Duplicate records from any source How: Unique constraints on business keys

CREATE TABLE payments (
    id SERIAL PRIMARY KEY,
    idempotency_key VARCHAR(255) UNIQUE,
    -- OR business key constraint
    UNIQUE(user_id, order_id)
);

Pros: Last line of defense, catches bugs in your code Cons: Happens late, after processing work

Best practice: Use both!

Request → API Idempotency Check → Business Logic → Database Constraint
              ↓                                          ↓
        (Fast rejection)                       (Final safety net)

API layer is the primary defense: fast, returns cached response. Database constraint is backup: catches bugs you didn't anticipate.

The database constraint should rarely trigger in production. If it does frequently, that's a sign your API layer idempotency is broken.

For payments specifically, I'd also add:

  • Application-level check of recent transactions
  • Reconciliation with payment processor
  • Alerts on constraint violations

Defense in depth is the right approach for critical flows."


Chapter 13: Interview Preparation Checklist

Before your interview, make sure you can:

Concepts

  • Explain idempotency with a real-world example
  • Describe the network timeout problem
  • Compare at-most-once, at-least-once, exactly-once

Implementation

  • Design an idempotency key storage system
  • Handle concurrent requests with same key
  • Recover from ambiguous failures (timeout)

Design Decisions

  • Compare client vs server generated keys
  • Choose appropriate TTL for different scenarios
  • Handle partial failures in multi-step operations

Operations

  • Investigate duplicate charge reports
  • Design monitoring for idempotency layer
  • Write tests for idempotency behavior

Exercises

Exercise 1: Implement Full Idempotency Layer

Build an idempotency system that:

  • Stores keys in Redis with PostgreSQL backup
  • Handles concurrent requests safely
  • Supports status polling for long operations
  • Includes metrics and logging

Exercise 2: Idempotency for Order Creation

Design idempotency for an order system where:

  • Users can place orders
  • Each order has multiple items
  • Inventory must be reserved
  • Payment must be processed
  • User might click "Place Order" twice

Exercise 3: Reconciliation System

Build a reconciliation job that:

  • Compares your payment records to bank records
  • Identifies mismatches (duplicate charges, missing records)
  • Generates alerts for human review
  • Runs daily and produces a report

Further Reading

  • Stripe API Documentation: Idempotent Requests
  • AWS: Building Distributed Applications: Idempotency patterns
  • "Designing Data-Intensive Applications": Chapter on distributed transactions
  • Brandur Leach's Blog: Implementing Stripe-like Idempotency Keys in Postgres

Appendix: Complete Idempotency Implementation

"""
Production-ready idempotency implementation.
Builds on Day 1's payment service, adding retry safety.
"""

import hashlib
import json
import time
import asyncio
import logging
from dataclasses import dataclass, field, asdict
from typing import Optional, Dict, Any, Callable, Awaitable
from datetime import datetime, timedelta
from enum import Enum
import redis.asyncio as redis
import asyncpg
from prometheus_client import Counter, Histogram

# =============================================================================
# Metrics
# =============================================================================

idempotency_checks = Counter(
    'idempotency_checks_total',
    'Idempotency check results',
    ['result']  # new, cached, conflict, in_progress
)

idempotency_latency = Histogram(
    'idempotency_check_seconds',
    'Idempotency check latency',
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5]
)

# =============================================================================
# Types
# =============================================================================

class IdempotencyStatus(Enum):
    PROCESSING = 'processing'
    COMPLETED = 'completed'
    FAILED = 'failed'

@dataclass
class IdempotencyRecord:
    key: str
    request_hash: str
    status: IdempotencyStatus
    response: Optional[Dict] = None
    http_status: Optional[int] = None
    created_at: datetime = field(default_factory=datetime.utcnow)
    completed_at: Optional[datetime] = None
    user_id: Optional[str] = None
    
    def to_dict(self) -> Dict:
        return {
            'key': self.key,
            'request_hash': self.request_hash,
            'status': self.status.value,
            'response': self.response,
            'http_status': self.http_status,
            'created_at': self.created_at.isoformat(),
            'completed_at': self.completed_at.isoformat() if self.completed_at else None,
            'user_id': self.user_id,
        }
    
    @classmethod
    def from_dict(cls, data: Dict) -> 'IdempotencyRecord':
        return cls(
            key=data['key'],
            request_hash=data['request_hash'],
            status=IdempotencyStatus(data['status']),
            response=data.get('response'),
            http_status=data.get('http_status'),
            created_at=datetime.fromisoformat(data['created_at']),
            completed_at=datetime.fromisoformat(data['completed_at']) if data.get('completed_at') else None,
            user_id=data.get('user_id'),
        )

# =============================================================================
# Exceptions
# =============================================================================

class IdempotencyError(Exception):
    pass

class IdempotencyConflictError(IdempotencyError):
    """Key reused with different request."""
    pass

class IdempotencyInProgressError(IdempotencyError):
    """Previous request still processing."""
    pass

# =============================================================================
# Main Implementation
# =============================================================================

class IdempotencyStore:
    """
    Production idempotency store with Redis cache and PostgreSQL persistence.
    
    Features:
    - Fast path through Redis for cache hits
    - PostgreSQL for durability and recovery
    - Atomic claim using database locks
    - Automatic TTL management
    """
    
    def __init__(
        self,
        redis_client: redis.Redis,
        pg_pool: asyncpg.Pool,
        cache_ttl: timedelta = timedelta(hours=1),
        record_ttl: timedelta = timedelta(days=7),
    ):
        self.redis = redis_client
        self.pg = pg_pool
        self.cache_ttl = cache_ttl
        self.record_ttl = record_ttl
        self.logger = logging.getLogger('idempotency')
    
    def _hash_request(self, request_body: Dict) -> str:
        """Create deterministic hash of request."""
        serialized = json.dumps(request_body, sort_keys=True, default=str)
        return hashlib.sha256(serialized.encode()).hexdigest()
    
    def _redis_key(self, key: str) -> str:
        return f"idempotency:{key}"
    
    async def check_and_claim(
        self,
        key: str,
        request_body: Dict,
        user_id: Optional[str] = None
    ) -> tuple[bool, Optional[IdempotencyRecord]]:
        """
        Check if key exists and claim it if not.
        
        Returns:
            (is_new, record)
            - (True, None): Key is new, claimed for processing
            - (False, record): Key exists, return cached record
            
        Raises:
            IdempotencyConflictError: Key exists with different request
            IdempotencyInProgressError: Previous request still processing
        """
        start = time.time()
        request_hash = self._hash_request(request_body)
        
        try:
            # Fast path: check Redis cache
            cached = await self._check_cache(key, request_hash)
            if cached:
                idempotency_checks.labels(result='cached').inc()
                return (False, cached)
            
            # Slow path: check and claim in database
            return await self._check_and_claim_db(key, request_hash, user_id)
        
        finally:
            idempotency_latency.observe(time.time() - start)
    
    async def _check_cache(self, key: str, request_hash: str) -> Optional[IdempotencyRecord]:
        """Check Redis cache for completed request."""
        redis_key = self._redis_key(key)
        cached = await self.redis.get(redis_key)
        
        if not cached:
            return None
        
        record = IdempotencyRecord.from_dict(json.loads(cached))
        
        # Validate request hash
        if record.request_hash != request_hash:
            raise IdempotencyConflictError(
                f"Key '{key}' was used with a different request"
            )
        
        if record.status == IdempotencyStatus.COMPLETED:
            return record
        
        if record.status == IdempotencyStatus.PROCESSING:
            raise IdempotencyInProgressError(
                f"Request '{key}' is still being processed"
            )
        
        # Failed - allow retry (will go to DB path)
        return None
    
    async def _check_and_claim_db(
        self,
        key: str,
        request_hash: str,
        user_id: Optional[str]
    ) -> tuple[bool, Optional[IdempotencyRecord]]:
        """Check and claim in database with locking."""
        
        async with self.pg.acquire() as conn:
            async with conn.transaction():
                # Try to get existing record with lock
                existing = await conn.fetchrow("""
                    SELECT * FROM idempotency_records
                    WHERE key = $1
                    FOR UPDATE
                """, key)
                
                if existing:
                    record = self._row_to_record(existing)
                    
                    # Validate request hash
                    if record.request_hash != request_hash:
                        idempotency_checks.labels(result='conflict').inc()
                        raise IdempotencyConflictError(
                            f"Key '{key}' was used with a different request"
                        )
                    
                    if record.status == IdempotencyStatus.COMPLETED:
                        # Cache for future fast-path
                        await self._cache_record(record)
                        idempotency_checks.labels(result='cached').inc()
                        return (False, record)
                    
                    if record.status == IdempotencyStatus.PROCESSING:
                        # Check if stuck (processing for too long)
                        if self._is_stuck(record):
                            # Allow retry - reset to processing
                            await self._reset_record(conn, key)
                            idempotency_checks.labels(result='new').inc()
                            return (True, None)
                        
                        idempotency_checks.labels(result='in_progress').inc()
                        raise IdempotencyInProgressError(
                            f"Request '{key}' is still being processed"
                        )
                    
                    # Failed - allow retry
                    await conn.execute("""
                        UPDATE idempotency_records
                        SET status = 'processing', completed_at = NULL
                        WHERE key = $1
                    """, key)
                    idempotency_checks.labels(result='new').inc()
                    return (True, None)
                
                # New key - insert
                await conn.execute("""
                    INSERT INTO idempotency_records
                    (key, request_hash, status, created_at, user_id, expires_at)
                    VALUES ($1, $2, 'processing', $3, $4, $5)
                """, key, request_hash, datetime.utcnow(), user_id,
                     datetime.utcnow() + self.record_ttl)
                
                idempotency_checks.labels(result='new').inc()
                return (True, None)
    
    def _is_stuck(self, record: IdempotencyRecord) -> bool:
        """Check if a processing record is stuck (> 5 minutes)."""
        stuck_threshold = timedelta(minutes=5)
        return datetime.utcnow() - record.created_at > stuck_threshold
    
    async def _reset_record(self, conn, key: str):
        """Reset a stuck record for retry."""
        await conn.execute("""
            UPDATE idempotency_records
            SET status = 'processing',
                created_at = $1,
                completed_at = NULL
            WHERE key = $2
        """, datetime.utcnow(), key)
    
    async def complete(
        self,
        key: str,
        response: Dict,
        http_status: int = 200
    ):
        """Mark request as completed with response."""
        now = datetime.utcnow()
        
        # Update database
        async with self.pg.acquire() as conn:
            await conn.execute("""
                UPDATE idempotency_records
                SET status = 'completed',
                    response = $1,
                    http_status = $2,
                    completed_at = $3
                WHERE key = $4
            """, json.dumps(response), http_status, now, key)
            
            # Get full record for caching
            row = await conn.fetchrow(
                "SELECT * FROM idempotency_records WHERE key = $1", key
            )
        
        # Update cache
        if row:
            record = self._row_to_record(row)
            await self._cache_record(record)
    
    async def fail(self, key: str, error: str):
        """Mark request as failed."""
        async with self.pg.acquire() as conn:
            await conn.execute("""
                UPDATE idempotency_records
                SET status = 'failed',
                    response = $1,
                    completed_at = $2
                WHERE key = $3
            """, json.dumps({'error': error}), datetime.utcnow(), key)
        
        # Remove from cache (allow retry)
        await self.redis.delete(self._redis_key(key))
    
    async def _cache_record(self, record: IdempotencyRecord):
        """Cache completed record in Redis."""
        redis_key = self._redis_key(record.key)
        await self.redis.setex(
            redis_key,
            int(self.cache_ttl.total_seconds()),
            json.dumps(record.to_dict())
        )
    
    def _row_to_record(self, row) -> IdempotencyRecord:
        """Convert database row to record."""
        return IdempotencyRecord(
            key=row['key'],
            request_hash=row['request_hash'],
            status=IdempotencyStatus(row['status']),
            response=json.loads(row['response']) if row['response'] else None,
            http_status=row['http_status'],
            created_at=row['created_at'],
            completed_at=row['completed_at'],
            user_id=row['user_id'],
        )


# =============================================================================
# High-Level Wrapper
# =============================================================================

async def with_idempotency(
    store: IdempotencyStore,
    key: str,
    request_body: Dict,
    processor: Callable[[], Awaitable[Dict]],
    user_id: Optional[str] = None
) -> tuple[Dict, int]:
    """
    Execute processor with idempotency protection.
    
    Returns: (response_body, http_status)
    """
    # Check/claim
    is_new, record = await store.check_and_claim(key, request_body, user_id)
    
    if not is_new:
        # Return cached response
        return (record.response, record.http_status)
    
    # Process
    try:
        response = await processor()
        await store.complete(key, response, http_status=200)
        return (response, 200)
    
    except Exception as e:
        await store.fail(key, str(e))
        raise


# =============================================================================
# Database Schema
# =============================================================================

SCHEMA = """
CREATE TABLE IF NOT EXISTS idempotency_records (
    key VARCHAR(256) PRIMARY KEY,
    request_hash CHAR(64) NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'processing',
    response JSONB,
    http_status INTEGER,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    completed_at TIMESTAMP,
    user_id VARCHAR(256),
    expires_at TIMESTAMP NOT NULL,
    
    CHECK (status IN ('processing', 'completed', 'failed'))
);

CREATE INDEX IF NOT EXISTS idx_idempotency_expires 
ON idempotency_records(expires_at);

CREATE INDEX IF NOT EXISTS idx_idempotency_user 
ON idempotency_records(user_id);

-- Cleanup job (run daily)
-- DELETE FROM idempotency_records WHERE expires_at < NOW();
"""


# =============================================================================
# Example Usage with Payment Service
# =============================================================================

async def example_usage():
    """Example of using idempotency with payment service."""
    
    # Setup
    redis_client = redis.from_url('redis://localhost')
    pg_pool = await asyncpg.create_pool('postgresql://localhost/payments')
    
    store = IdempotencyStore(redis_client, pg_pool)
    payment_service = PaymentService(...)  # From Day 1
    
    # Handle payment request
    async def handle_payment(request):
        key = request.headers.get('Idempotency-Key')
        
        if not key:
            return Response(status=400, body={'error': 'Idempotency-Key required'})
        
        async def process():
            result = await payment_service.process_payment(
                user_id=request.json['user_id'],
                amount=request.json['amount'],
                idempotency_key=key
            )
            return result.to_dict()
        
        try:
            response, status = await with_idempotency(
                store=store,
                key=key,
                request_body=request.json,
                processor=process,
                user_id=request.json.get('user_id')
            )
            return Response(status=status, body=response)
        
        except IdempotencyConflictError:
            return Response(status=422, body={
                'error': 'Idempotency key was used with different request parameters'
            })
        
        except IdempotencyInProgressError:
            return Response(status=409, body={
                'error': 'Request is still being processed. Please retry shortly.'
            })

End of Day 2: Idempotency in Practice

Tomorrow: Day 3 — Circuit Breakers. We've solved the timeout problem (Day 1) and the duplicate problem (Day 2). But what happens when a downstream service is failing repeatedly? We don't want to keep trying and waiting. Circuit breakers let us fail fast and give struggling services time to recover.