Week 2 — Day 2: Idempotency in Practice
System Design Mastery Series
Preface
Yesterday, we built a payment service with proper timeouts. We solved the "waiting forever" problem. But we left a dangerous hole:
User clicks "Pay $99"
→ Your server calls bank API
→ Bank API times out after 3.5 seconds
→ You return: "Payment processing timed out"
Question: Did the bank charge the user or not?
Answer: You don't know.
The bank might have:
- Never received your request (network died before it arrived)
- Received it, processed it, but the response got lost
- Received it, still processing, will complete in 1 second
If you tell the user "try again" and they do, you might charge them twice.
This is the problem idempotency solves.
Today, we make our payment system safe to retry. No matter how many times a user clicks "Pay," they'll be charged exactly once.
Part I: Foundations
Chapter 1: What Is Idempotency?
1.1 The Simple Definition
An operation is idempotent if doing it multiple times has the same effect as doing it once.
Idempotent operations (safe to repeat):
"Set the thermostat to 72°F"
→ Do it once: temperature is 72°F
→ Do it twice: temperature is still 72°F
→ Do it 100 times: temperature is still 72°F ✓
"Delete file X"
→ Do it once: file is deleted
→ Do it twice: file is still deleted (no-op second time)
→ No harm in repeating ✓
Non-idempotent operations (dangerous to repeat):
"Add $10 to account"
→ Do it once: +$10
→ Do it twice: +$20 (double the intended effect!)
→ Repeating causes harm ✗
"Send email to user"
→ Do it once: 1 email
→ Do it twice: 2 emails (spam!)
→ Repeating causes harm ✗
1.2 The Everyday Analogy: The Light Switch
Think about pressing a light switch:
Toggle switch (non-idempotent):
Press once: Light turns ON
Press again: Light turns OFF
Press again: Light turns ON
Each press changes the state. Dangerous if you're not sure
how many times you pressed!
ON/OFF buttons (idempotent):
Press ON: Light is ON
Press ON again: Light is still ON
Press ON 10 times: Light is still ON
No matter how many times you press, you get the same result.
We want our payment system to behave like ON/OFF buttons, not toggle switches.
1.3 Why Distributed Systems Need Idempotency
In a perfect world:
- Client sends request
- Server processes it
- Server sends response
- Client receives response
But networks fail:
Scenario 1: Request lost
Client → [request lost] → Server
Server never sees request, nothing happens.
Client times out. Safe to retry. ✓
Scenario 2: Response lost
Client → Request → Server
Server processes successfully
Client ← [response lost] ← Server
Client times out.
From client's view: Same as Scenario 1!
But server already processed it!
If client retries: DOUBLE PROCESSING ✗
The client cannot tell the difference between "request lost" and "response lost."
This is fundamental. No amount of timeout tuning fixes it. The only solution is making operations safe to retry.
1.4 HTTP Methods and Idempotency
HTTP defines some methods as idempotent by design:
| Method | Idempotent? | Why |
|---|---|---|
| GET | Yes | Reading doesn't change state |
| PUT | Yes | "Set resource to this value" is repeatable |
| DELETE | Yes | "Delete resource" — deleting twice = still deleted |
| HEAD | Yes | Same as GET, no body |
| OPTIONS | Yes | Just asking about capabilities |
| POST | No | "Create new resource" — creates duplicates! |
| PATCH | Depends | Could be "add $10" (no) or "set price to $10" (yes) |
POST is where the danger lives. And most APIs use POST for important operations like payments.
Chapter 2: The Idempotency Key Pattern
2.1 The Core Idea
The client generates a unique identifier for each logical operation. The server remembers this identifier and its result.
First request:
Client → POST /payments
Idempotency-Key: pay_abc123
{amount: 99.00}
Server: "I've never seen pay_abc123 before"
Process payment
Store: pay_abc123 → {status: success, id: txn_789}
Server → 200 OK {status: success, id: txn_789}
Second request (retry):
Client → POST /payments
Idempotency-Key: pay_abc123 (same key!)
{amount: 99.00}
Server: "I've seen pay_abc123 before!"
Look up stored result
Server → 200 OK {status: success, id: txn_789} (same response!)
The payment only happened once, but the client got the confirmation it needed.
2.2 The Mental Model: The Coat Check
Think of a coat check at a restaurant:
You arrive at restaurant:
You: "Here's my coat" (request)
Attendant: Takes coat, gives you ticket #42 (idempotency key)
Later, you're not sure if you got a ticket:
You: "Here's my coat again" + ticket #42
Attendant: "I already have your coat for #42, here's the same ticket"
You don't end up checking your coat twice.
The ticket is proof of the operation.
2.3 Implementation: Basic Version
import hashlib
import json
import redis
from typing import Optional, Tuple
from dataclasses import dataclass
from datetime import timedelta
@dataclass
class IdempotencyRecord:
key: str
request_hash: str # To detect conflicting requests
status: str # 'processing', 'completed', 'failed'
response: Optional[dict]
created_at: float
class IdempotencyStore:
"""
Stores idempotency keys and their results.
Uses Redis for fast lookups and automatic expiration.
"""
def __init__(self, redis_client: redis.Redis, ttl_hours: int = 24):
self.redis = redis_client
self.ttl = timedelta(hours=ttl_hours)
def _make_redis_key(self, idempotency_key: str) -> str:
return f"idempotency:{idempotency_key}"
def _hash_request(self, request_body: dict) -> str:
"""Create hash of request to detect conflicting retries."""
serialized = json.dumps(request_body, sort_keys=True)
return hashlib.sha256(serialized.encode()).hexdigest()[:16]
def check_and_set(
self,
idempotency_key: str,
request_body: dict
) -> Tuple[bool, Optional[dict]]:
"""
Check if key exists. If not, claim it.
Returns:
(is_new, existing_response)
- (True, None) if this is a new key
- (False, response) if key exists with completed response
- Raises ConflictError if key exists with different request
"""
redis_key = self._make_redis_key(idempotency_key)
request_hash = self._hash_request(request_body)
# Try to get existing record
existing = self.redis.get(redis_key)
if existing:
record = json.loads(existing)
# Check if request matches
if record['request_hash'] != request_hash:
raise IdempotencyConflictError(
f"Key {idempotency_key} used with different request"
)
# If completed, return cached response
if record['status'] == 'completed':
return (False, record['response'])
# If still processing, client should wait
if record['status'] == 'processing':
raise IdempotencyInProgressError(
f"Request {idempotency_key} is still being processed"
)
# If failed, allow retry
if record['status'] == 'failed':
# Update to processing and allow retry
record['status'] = 'processing'
self.redis.setex(
redis_key,
self.ttl,
json.dumps(record)
)
return (True, None)
# New key - claim it
record = {
'key': idempotency_key,
'request_hash': request_hash,
'status': 'processing',
'response': None,
'created_at': time.time()
}
# Use SETNX for atomic claim (only set if not exists)
claimed = self.redis.setnx(redis_key, json.dumps(record))
if not claimed:
# Race condition - another request claimed it first
# Recurse to handle the existing record
return self.check_and_set(idempotency_key, request_body)
# Set TTL
self.redis.expire(redis_key, self.ttl)
return (True, None)
def complete(self, idempotency_key: str, response: dict):
"""Mark request as completed with response."""
redis_key = self._make_redis_key(idempotency_key)
existing = self.redis.get(redis_key)
if not existing:
raise ValueError(f"No record for key {idempotency_key}")
record = json.loads(existing)
record['status'] = 'completed'
record['response'] = response
self.redis.setex(redis_key, self.ttl, json.dumps(record))
def fail(self, idempotency_key: str, error: str):
"""Mark request as failed."""
redis_key = self._make_redis_key(idempotency_key)
existing = self.redis.get(redis_key)
if not existing:
return
record = json.loads(existing)
record['status'] = 'failed'
record['response'] = {'error': error}
self.redis.setex(redis_key, self.ttl, json.dumps(record))
class IdempotencyConflictError(Exception):
"""Raised when idempotency key is reused with different request."""
pass
class IdempotencyInProgressError(Exception):
"""Raised when request is still being processed."""
pass
2.4 Using the Idempotency Store
from typing import Callable
def with_idempotency(
idempotency_store: IdempotencyStore,
idempotency_key: str,
request_body: dict,
process_func: Callable[[], dict]
) -> dict:
"""
Execute function with idempotency protection.
If key was seen before with same request, return cached response.
If key is new, execute function and cache result.
"""
# Check if we've seen this request
is_new, cached_response = idempotency_store.check_and_set(
idempotency_key,
request_body
)
if not is_new:
# Return cached response (this is a retry)
return cached_response
# Process the request
try:
response = process_func()
idempotency_store.complete(idempotency_key, response)
return response
except Exception as e:
idempotency_store.fail(idempotency_key, str(e))
raise
# Usage in payment handler
@app.post('/payments')
def create_payment(request):
idempotency_key = request.headers.get('Idempotency-Key')
if not idempotency_key:
return Response(status=400, body="Idempotency-Key header required")
def process():
return payment_service.process_payment(
user_id=request.json['user_id'],
amount=request.json['amount']
)
try:
result = with_idempotency(
idempotency_store,
idempotency_key,
request.json,
process
)
return Response(status=200, body=result)
except IdempotencyConflictError:
return Response(status=422, body="Idempotency key reused with different request")
except IdempotencyInProgressError:
return Response(status=409, body="Request still processing, please wait")
Chapter 3: Client-Generated vs Server-Generated Keys
3.1 Client-Generated Keys
The client creates the idempotency key and sends it with the request.
# Client code
import uuid
def pay(amount: float):
idempotency_key = f"pay_{uuid.uuid4()}"
response = requests.post(
'/payments',
headers={'Idempotency-Key': idempotency_key},
json={'amount': amount}
)
if response.status_code in [500, 502, 503, 504] or response is timeout:
# Safe to retry with same key
response = requests.post(
'/payments',
headers={'Idempotency-Key': idempotency_key}, # Same key!
json={'amount': amount}
)
return response
Pros:
- Client controls retry behavior
- Works across client restarts (if key is persisted)
- No server state needed before first request
Cons:
- Client might generate poor keys (duplicates, predictable)
- Client might forget the key and retry with a new one (defeats purpose)
- Requires client-side state management
3.2 Server-Generated Keys
Server provides a key that client uses for subsequent operations.
# Step 1: Client requests a payment intent
response = requests.post('/payment-intents', json={'amount': 99.00})
payment_intent_id = response.json()['id'] # "pi_abc123"
# Step 2: Client confirms payment (can retry safely)
response = requests.post(
f'/payment-intents/{payment_intent_id}/confirm',
json={'payment_method': 'card_xyz'}
)
# If step 2 times out, client can retry with same payment_intent_id
# Server knows whether confirmation already happened
Pros:
- Server controls key format and uniqueness
- Natural fit for multi-step workflows
- Client can't generate bad keys
Cons:
- Requires extra round-trip to get key
- Server must store intent before payment
3.3 Hybrid: Deterministic Keys from Request
Derive the key from request content:
import hashlib
def generate_idempotency_key(user_id: str, order_id: str, amount: float) -> str:
"""Generate deterministic key from request parameters."""
content = f"{user_id}:{order_id}:{amount}"
return hashlib.sha256(content.encode()).hexdigest()[:32]
# Same order always generates same key
key = generate_idempotency_key("user_123", "order_456", 99.00)
# → "a7b3c9d2e1f0..."
Pros:
- No state needed on client
- Same logical operation always uses same key
- Natural deduplication
Cons:
- Can't distinguish intentional duplicates
- User buying same item twice = same key = blocked!
Solution: Add timestamp bucket
from datetime import datetime
def generate_idempotency_key(user_id: str, order_id: str, amount: float) -> str:
# Include hour to allow same purchase later
hour = datetime.now().strftime("%Y%m%d%H")
content = f"{user_id}:{order_id}:{amount}:{hour}"
return hashlib.sha256(content.encode()).hexdigest()[:32]
3.4 Which to Choose?
| Scenario | Recommendation |
|---|---|
| Simple API with savvy clients | Client-generated UUID |
| Public API with unknown clients | Server-generated (payment intents) |
| Internal microservices | Deterministic from request |
| User-facing buttons (buy, submit) | Server-generated or deterministic |
Chapter 4: TTL and Deduplication Windows
4.1 How Long to Remember?
Idempotency records can't live forever:
- Storage cost grows unbounded
- Old keys might collide with new ones
- You need to allow legitimate re-submission eventually
But too short a TTL defeats the purpose:
- Request times out at 30 seconds
- TTL is 10 seconds
- Client retries at 35 seconds → key expired → double charge!
Timeline:
0s: Client sends request
3.5s: Request times out (from Day 1: our bank timeout)
5s: Client retries → should be deduplicated
30s: Client gives up
31s: Original request finally completes at bank (very slow!)
If TTL = 60s:
- 5s retry: deduplicated ✓
- 31s completion: recorded ✓
- User refreshes page at 45s: sees success ✓
If TTL = 10s:
- 5s retry: deduplicated ✓
- 31s completion: but record is gone!
- Orphan transaction, no record ✗
4.2 TTL Guidelines
# Conservative recommendations
IDEMPOTENCY_TTL = {
# For synchronous APIs (user waiting)
'payment': timedelta(hours=24),
# For async operations (processed later)
'batch_job': timedelta(days=7),
# For webhooks (external systems retry for days)
'webhook': timedelta(days=3),
# For internal services (known retry behavior)
'internal_rpc': timedelta(hours=1),
}
Rule of thumb: TTL should be at least 10× your maximum retry window.
4.3 What to Store
@dataclass
class IdempotencyRecord:
# Required fields
key: str # The idempotency key
status: str # processing, completed, failed
response: Optional[dict] # Cached response to return
created_at: datetime # When first seen
# Recommended fields
request_hash: str # To detect conflicting requests
completed_at: Optional[datetime] # When completed
# Optional but useful
user_id: Optional[str] # Who made the request
request_path: str # Which endpoint
request_body: dict # Original request (for debugging)
processing_time_ms: int # How long it took
4.4 Storage Options
| Store | Pros | Cons |
|---|---|---|
| Redis | Fast, built-in TTL | Memory cost, persistence concerns |
| PostgreSQL | Durable, familiar | Slower, manual TTL cleanup |
| DynamoDB | Managed, TTL support | Cost at high volume |
| In-memory | Fastest | Lost on restart, not distributed |
For payments: Use durable storage (PostgreSQL/DynamoDB) with Redis cache in front.
Chapter 5: The "Network Timeout" Problem
5.1 Yesterday's Unsolved Problem
From Day 1, our payment service:
async def _charge_bank(self, budget, user_id, amount):
try:
response = await client.post(BANK_URL, json={...}, timeout=timeout)
return PaymentResult(status=SUCCESS, transaction_id=response['id'])
except TimeoutError:
# ← This is the problem!
# Did the bank charge the user or not?
return PaymentResult(
status=ERROR,
error_message='Payment processing timed out'
)
The user sees "timed out" and might retry. The bank might have charged them.
5.2 The Solution: Pre-Register Before Calling Bank
New flow:
1. Receive payment request with idempotency key
2. Check if key exists → return cached result if so
3. Create idempotency record with status='processing'
4. Call bank API
5. Whether success or failure, update idempotency record
6. Return result
On timeout:
- Idempotency record exists with status='processing'
- We don't know the outcome
- Client retries:
- We see status='processing'
- We check with bank: "Did transaction X complete?"
- Update record based on bank's answer
- Return consistent result
5.3 Implementation
class PaymentServiceWithIdempotency:
"""
Payment service from Day 1, now with idempotency.
"""
def __init__(self, config: PaymentConfig, idempotency_store: IdempotencyStore):
self.config = config
self.idempotency = idempotency_store
self.logger = logging.getLogger('payment_service')
async def process_payment(
self,
user_id: str,
amount: float,
idempotency_key: str
) -> PaymentResult:
"""
Process payment with idempotency protection.
Safe to call multiple times with same idempotency_key.
"""
request_body = {'user_id': user_id, 'amount': amount}
# Step 1: Check idempotency
try:
is_new, cached = self.idempotency.check_and_set(
idempotency_key,
request_body
)
if not is_new:
self.logger.info(f"Returning cached result for {idempotency_key}")
return PaymentResult(**cached)
except IdempotencyInProgressError:
# Previous request still processing
# Could be: our timeout fired, but bank is still working
return await self._handle_in_progress(idempotency_key, request_body)
except IdempotencyConflictError as e:
return PaymentResult(status=PaymentStatus.ERROR, error_message=str(e))
# Step 2: Process payment (now safe to proceed)
try:
result = await self._do_payment(user_id, amount, idempotency_key)
# Step 3: Record success
self.idempotency.complete(idempotency_key, result.__dict__)
return result
except Exception as e:
# Step 3: Record failure
self.idempotency.fail(idempotency_key, str(e))
raise
async def _do_payment(
self,
user_id: str,
amount: float,
idempotency_key: str
) -> PaymentResult:
"""
Actual payment processing (from Day 1).
"""
budget = TimeoutBudget(self.config.total_budget_ms)
# Fraud check
fraud_result = await self._check_fraud(budget, user_id, amount)
if fraud_result.status != PaymentStatus.SUCCESS:
return fraud_result
# Bank charge (pass idempotency key to bank!)
bank_result = await self._charge_bank(
budget, user_id, amount, idempotency_key
)
if bank_result.status == PaymentStatus.SUCCESS:
# Non-critical notification
asyncio.create_task(
self._send_notification(user_id, amount, bank_result.transaction_id)
)
return bank_result
async def _charge_bank(
self,
budget: TimeoutBudget,
user_id: str,
amount: float,
idempotency_key: str
) -> PaymentResult:
"""
Charge bank with idempotency.
Even if we timeout, we can recover.
"""
config = self.config.bank_api
try:
timeout = budget.get_timeout(config.read_timeout * 1000)
# Pass our idempotency key to bank
# Bank will also deduplicate on their end
response = await self.http_client.post(
f"{config.url}/charge",
headers={'Idempotency-Key': idempotency_key},
json={'user_id': user_id, 'amount': amount},
timeout=timeout
)
return PaymentResult(
status=PaymentStatus.SUCCESS,
transaction_id=response.json()['transaction_id']
)
except TimeoutError:
# This is where Day 1 left us stuck!
# Now we can handle it:
return await self._recover_from_timeout(idempotency_key, user_id, amount)
async def _recover_from_timeout(
self,
idempotency_key: str,
user_id: str,
amount: float
) -> PaymentResult:
"""
Handle bank timeout by checking transaction status.
"""
self.logger.warning(f"Bank timeout for {idempotency_key}, checking status")
try:
# Ask bank: "Did this transaction complete?"
status_response = await self.http_client.get(
f"{self.config.bank_api.url}/transactions",
params={'idempotency_key': idempotency_key},
timeout=5.0 # Short timeout for status check
)
data = status_response.json()
if data.get('found'):
# Transaction exists at bank
if data['status'] == 'completed':
return PaymentResult(
status=PaymentStatus.SUCCESS,
transaction_id=data['transaction_id']
)
elif data['status'] == 'failed':
return PaymentResult(
status=PaymentStatus.ERROR,
error_message=data.get('error', 'Bank declined')
)
else:
# Still processing
return PaymentResult(
status=PaymentStatus.ERROR,
error_message='Payment is still processing. Please check back shortly.'
)
else:
# Transaction not found - request never reached bank
# Safe to return error and allow retry
return PaymentResult(
status=PaymentStatus.ERROR,
error_message='Payment could not be processed. Please try again.'
)
except Exception as e:
# Can't even check status - be honest with user
self.logger.error(f"Status check failed: {e}")
return PaymentResult(
status=PaymentStatus.ERROR,
error_message='Unable to confirm payment status. Please check your statement before retrying.'
)
async def _handle_in_progress(
self,
idempotency_key: str,
request_body: dict
) -> PaymentResult:
"""
Handle case where previous request is still processing.
This happens when our timeout fired but bank is still working.
"""
self.logger.info(f"Request {idempotency_key} in progress, waiting")
# Poll for completion
for _ in range(10): # Try for 10 seconds
await asyncio.sleep(1)
is_new, cached = self.idempotency.check_and_set(
idempotency_key,
request_body
)
if not is_new and cached:
return PaymentResult(**cached)
# Still processing after 10 seconds
return PaymentResult(
status=PaymentStatus.ERROR,
error_message='Payment is taking longer than expected. Please check back shortly.'
)
Part II: The Design Challenge
Chapter 6: The Double-Click Problem
6.1 The Scenario
User experience:
1. User fills out payment form
2. User clicks "Pay Now"
3. Spinner appears
4. After 2 seconds, nothing visible happens
5. User thinks: "Did it work?" → clicks "Pay Now" again
6. System receives two payment requests within milliseconds
Without idempotency:
Request 1: Starts processing
Request 2: Also starts processing
Result: User charged twice!
With idempotency:
Request 1: Creates idempotency record, starts processing
Request 2: Sees idempotency record → returns "in progress" or waits
Result: User charged once ✓
6.2 The Full Timeline
Let's trace through every scenario:
Timeline with idempotency:
T=0.000s: User clicks "Pay Now"
Client generates: idempotency_key = "order_123_pay_1"
Request 1 sent
T=0.001s: Server receives Request 1
Check idempotency store: not found
Create record: {key: "order_123_pay_1", status: "processing"}
Start fraud check
T=2.000s: User clicks "Pay Now" again (impatient)
Client sends Request 2 with SAME idempotency_key
T=2.001s: Server receives Request 2
Check idempotency store: found, status = "processing"
Return: "Request in progress, please wait"
Client shows: "Still processing..."
T=3.000s: Request 1 completes fraud check, calls bank
T=3.500s: Bank responds: success!
Update record: {status: "completed", response: {...}}
Request 1 returns: success
T=3.600s: Client (from Request 2) retries
Check idempotency store: found, status = "completed"
Return: cached response (success)
T=3.700s: User sees: "Payment successful!"
Only charged once ✓
6.3 The Response-Lost Scenario
Timeline when response is lost:
T=0.000s: Client sends Request 1
T=0.001s: Server creates idempotency record
T=3.000s: Server completes payment, sends response
T=3.001s: Response packet is lost in network
T=5.000s: Client timeout fires → shows error to user
T=5.100s: User clicks "Try Again"
T=5.200s: Client sends Request 2 (same idempotency key)
T=5.201s: Server checks idempotency store: found, status = "completed"
Return: cached response (success!)
T=5.300s: User sees: "Payment successful!"
Only charged once, despite the "error" ✓
6.4 The Server-Crash Scenario
Timeline when server crashes:
T=0.000s: Client sends Request 1
T=0.001s: Server creates idempotency record: status = "processing"
T=1.000s: Server crashes mid-processing!
T=5.000s: Client timeout → user retries
T=5.100s: Request 2 hits DIFFERENT server
T=5.101s: Check idempotency store: found, status = "processing"
But it's been 5 seconds... something is wrong
Two options:
Option A: Strict (reject)
Return: "Previous request may still be processing"
User must check statement manually
Option B: Recovery (check and retry)
Check bank: "Did this transaction complete?"
Bank: "No record of it"
Reset idempotency record, process new request
Option B is better UX but requires bank lookup capability
Chapter 7: Edge Cases That Will Bite You
7.1 Edge Case 1: Key Reuse with Different Request
# Dangerous: Same key, different amount
# Request 1
POST /payments
Idempotency-Key: key_abc
{amount: 100}
# Request 2 (bug or malicious)
POST /payments
Idempotency-Key: key_abc
{amount: 1000000} # Much larger!
# What happens?
Solution: Hash the request body and compare
def check_and_set(self, key: str, request_body: dict):
request_hash = hash_request(request_body)
existing = self.get(key)
if existing:
if existing.request_hash != request_hash:
raise IdempotencyConflictError(
"Key reused with different request parameters"
)
# ... rest of logic
7.2 Edge Case 2: Key Expiration Race
T=0: Request starts, idempotency record created
T=23h59m: Request finally completes (extreme case)
T=24h: TTL expires, record deleted
T=24h30s: Response sent to client
T=24h35s: Client receives timeout, retries
T=24h36s: No idempotency record found → processes again!
User charged twice, 24 hours apart.
Solution: Extend TTL on completion
def complete(self, key: str, response: dict):
record = self.get(key)
record.status = 'completed'
record.response = response
# Extend TTL to ensure long-lived record for completed transactions
extended_ttl = max(self.ttl, timedelta(hours=24))
self.store(key, record, ttl=extended_ttl)
7.3 Edge Case 3: Concurrent Requests with Same Key
T=0.000s: Request A creates record, starts processing
T=0.001s: Request B checks record, sees "processing"
T=0.002s: Request A fails, marks record as "failed"
T=0.003s: Request B... what should it do?
Option A: Return the failure
But Request B didn't fail - it never ran!
Option B: Allow Request B to retry
Better - Request B gets a chance
Solution: Only completed requests block retries
def check_and_set(self, key: str, request_body: dict):
existing = self.get(key)
if existing:
if existing.status == 'completed':
# Definitely don't retry - return cached success
return (False, existing.response)
elif existing.status == 'processing':
# Still running - make caller wait
raise IdempotencyInProgressError()
elif existing.status == 'failed':
# Previous attempt failed - allow retry
existing.status = 'processing'
self.store(key, existing)
return (True, None)
# New key
self.create(key, request_body)
return (True, None)
7.4 Edge Case 4: Partial Completion
Payment flow:
1. Charge customer ✓
2. Credit merchant ← fails!
3. Send receipt
What's the idempotency status?
- "completed" would be wrong (merchant not credited)
- "failed" would allow retry (double charge customer!)
Solution: Atomic operations or saga pattern
async def process_payment(self, ...):
# Begin transaction
async with self.db.transaction():
# Charge customer
charge_result = await self.charge_customer(...)
# Credit merchant
credit_result = await self.credit_merchant(...)
# Both succeed or both fail (atomic)
Or with saga pattern (Day 4 webhook topic):
async def process_payment(self, ...):
charge_result = await self.charge_customer(...)
try:
credit_result = await self.credit_merchant(...)
except:
# Compensating action
await self.refund_customer(charge_result.id)
raise
Chapter 8: Idempotency Key Design
8.1 Key Format Recommendations
# Good key formats:
# UUID (random, unique)
"550e8400-e29b-41d4-a716-446655440000"
# Prefixed UUID (self-documenting)
"pay_550e8400-e29b-41d4-a716-446655440000"
# Entity-based (deterministic)
"user_123_order_456_v1"
# Timestamp-bucketed (allows intentional retries after window)
"user_123_order_456_2024010112" # Includes hour
# Bad key formats:
# Sequential (predictable, vulnerable)
"1", "2", "3"
# User-controlled (can be malicious)
"../../../etc/passwd"
# Too short (collision risk)
"abc"
# Timestamp only (same user, same millisecond = collision)
"1704067200000"
8.2 Key Generation Best Practices
import uuid
import hashlib
from datetime import datetime
class IdempotencyKeyGenerator:
"""Generate idempotency keys with various strategies."""
@staticmethod
def random() -> str:
"""Random UUID - use when client controls retry."""
return f"key_{uuid.uuid4()}"
@staticmethod
def deterministic(user_id: str, action: str, entity_id: str) -> str:
"""Deterministic key - same input = same key."""
content = f"{user_id}:{action}:{entity_id}"
return hashlib.sha256(content.encode()).hexdigest()[:32]
@staticmethod
def time_bucketed(
user_id: str,
action: str,
entity_id: str,
bucket_minutes: int = 60
) -> str:
"""Allow same action after time window."""
now = datetime.utcnow()
bucket = now.replace(
minute=(now.minute // bucket_minutes) * bucket_minutes,
second=0,
microsecond=0
)
content = f"{user_id}:{action}:{entity_id}:{bucket.isoformat()}"
return hashlib.sha256(content.encode()).hexdigest()[:32]
@staticmethod
def composite(prefix: str, *parts: str) -> str:
"""Readable composite key."""
safe_parts = [p.replace('_', '-') for p in parts]
return f"{prefix}_{'_'.join(safe_parts)}"
# Usage examples:
# For user-initiated payment (use random)
key = IdempotencyKeyGenerator.random()
# → "key_550e8400-e29b-41d4-a716-446655440000"
# For order payment (use deterministic)
key = IdempotencyKeyGenerator.deterministic("user_123", "pay", "order_456")
# → "a1b2c3d4e5f6..."
# For daily charge (use time bucketed - daily)
key = IdempotencyKeyGenerator.time_bucketed("user_123", "subscription", "sub_789", 1440)
# → "x1y2z3..." (changes each day)
# For webhook (use composite for readability)
key = IdempotencyKeyGenerator.composite("webhook", "order_created", "order_456")
# → "webhook_order-created_order-456"
8.3 Key Validation
import re
class IdempotencyKeyValidator:
"""Validate idempotency keys."""
MIN_LENGTH = 16
MAX_LENGTH = 256
PATTERN = re.compile(r'^[a-zA-Z0-9_\-]+$')
@classmethod
def validate(cls, key: str) -> tuple[bool, str]:
"""
Validate key format.
Returns (is_valid, error_message).
"""
if not key:
return False, "Idempotency key is required"
if len(key) < cls.MIN_LENGTH:
return False, f"Key too short (min {cls.MIN_LENGTH} chars)"
if len(key) > cls.MAX_LENGTH:
return False, f"Key too long (max {cls.MAX_LENGTH} chars)"
if not cls.PATTERN.match(key):
return False, "Key contains invalid characters"
return True, ""
# Usage in middleware
@app.middleware
def validate_idempotency_key(request, call_next):
key = request.headers.get('Idempotency-Key')
if request.method == 'POST': # Only required for non-idempotent methods
if not key:
return Response(status=400, body="Idempotency-Key header required")
is_valid, error = IdempotencyKeyValidator.validate(key)
if not is_valid:
return Response(status=400, body=error)
return call_next(request)
Part III: The Document Challenge
Chapter 9: Your Idempotency Strategy Document
Here's a template for documenting your idempotency strategy:
# Payment Idempotency Strategy
## Overview
All payment operations MUST be idempotent. Users can safely retry any
payment operation without risk of duplicate charges.
## Idempotency Key Format
### Client-Generated Keys (API Consumers)
- Format: `pay_<uuid4>`
- Example: `pay_550e8400-e29b-41d4-a716-446655440000`
- Required header: `Idempotency-Key: <key>`
- Validation: 16-256 alphanumeric characters plus `-` and `_`
### Server-Generated Keys (Internal)
- Format: `<action>_<entity>_<hash>`
- Example: `charge_order_abc123_v1`
- Generated from: user_id + order_id + amount
## Storage
- Primary: PostgreSQL `idempotency_keys` table
- Cache: Redis with 1-hour TTL
- Record TTL: 7 days for completed, 24 hours for failed
## Deduplication Window
| Scenario | Window |
|----------|--------|
| User retry (network timeout) | Immediate |
| Accidental double-click | Immediate |
| Page refresh | 1 hour |
| Intentional re-purchase | After 24 hours |
## Edge Cases
### 1. Same Key, Different Request
- Detection: Hash request body, compare on lookup
- Response: HTTP 422 "Idempotency key reused with different parameters"
### 2. Request In Progress
- Detection: Record exists with status='processing'
- Response: HTTP 409 "Request is being processed"
- Client action: Retry with exponential backoff
### 3. Previous Request Failed
- Detection: Record exists with status='failed'
- Response: Allow new attempt (reset status to 'processing')
### 4. Key Expired
- Detection: Record not found (TTL expired)
- Response: Process as new request
- Risk: Potential duplicate if original eventually completes
- Mitigation: Check bank API for existing transaction
## Reconciliation
Daily job to detect inconsistencies:
1. Find payments where our record differs from bank
2. Find orphaned transactions (bank charged, we have no record)
3. Alert on any discrepancies for manual review
## Monitoring
### Alerts
- Duplicate attempt rate > 5% (user experience issue)
- Conflict rate > 0.1% (potential bug or attack)
- Orphaned transaction detected (immediate page)
### Metrics
- `idempotency_cache_hit_rate` - Should be > 95%
- `idempotency_duplicate_prevented_total` - Growing is good
- `idempotency_conflict_total` - Should be near zero
Part IV: Discussion and Trade-offs
Chapter 10: The Hard Questions
10.1 "User clicks pay twice. Network timeout on first request—did it go through?"
Strong Answer (connecting to Day 1):
"This is exactly why we need idempotency, and why timeouts alone don't solve the problem.
When the first request times out, we don't know if the bank processed it. From Day 1, we learned that a timeout could mean:
- Request never reached bank
- Bank processed it, response was lost
- Bank is still processing
With idempotency:
Before calling bank:
- Client sends idempotency key:
pay_abc123 - Server records:
{key: pay_abc123, status: processing}
Timeout occurs:
- Server doesn't know outcome
- Returns: 'Payment processing timed out. Please check your statement.'
User clicks again (same key):
- Server sees: status = 'processing'
- Server asks bank: 'What happened to pay_abc123?'
- Bank says: 'Completed' or 'Not found'
- Server updates record and returns appropriate response
The key insight is: we pass our idempotency key TO the bank. The bank also deduplicates. So even if we call twice, they only charge once. And we can query by that key to find out what happened."
10.2 "How long do you keep idempotency records?"
Strong Answer:
"It depends on the retry pattern and compliance requirements.
Minimum: At least 10× your maximum timeout. If your timeout is 30 seconds and clients retry 3 times, that's 90 seconds. Keep records for at least 15 minutes.
Practical default: 24 hours covers most user retry scenarios:
- Immediate double-click: covered
- Refresh page after 5 minutes: covered
- Come back next day and retry: NOT covered (intentional)
For financial transactions: Often 7+ days for compliance. You need to prove no duplicates occurred.
Storage strategy:
Redis (hot): 1 hour TTL - fast lookups for recent requests
PostgreSQL (cold): 30 day retention - audit trail and reconciliation
Archive (frozen): 7 years - regulatory compliance
The TTL tradeoff:
- Too short: Legitimate retries create duplicates
- Too long: Storage costs, stale data, potential key collisions
I'd rather err on the side of too long—storage is cheap, duplicate charges are expensive."
10.3 "Client-generated vs server-generated idempotency keys—which is better?"
Strong Answer:
"Neither is universally better. The right choice depends on the client and use case.
Client-generated (UUID):
- Best for: API clients you trust (internal services, partner integrations)
- Pros: Client controls retry logic, no extra round-trip
- Cons: Client might generate weak keys or lose the key
Server-generated (payment intents):
- Best for: Public APIs, web/mobile frontends
- Pros: Server ensures key quality, natural multi-step flow
- Cons: Extra round-trip, more server state
Deterministic (from request):
- Best for: Internal microservices, known request shapes
- Pros: No state to manage, same request = same key automatically
- Cons: Can't distinguish intentional duplicates
For a public payment API like Stripe, I'd use server-generated (payment intents pattern). Create the intent, get an ID, then confirm with that ID. The ID serves as both the idempotency key and the resource reference.
For internal service-to-service calls, I'd use deterministic keys derived from the request. The calling service doesn't need to manage state, and duplicates are naturally deduplicated."
Chapter 11: Session Summary
What You Should Know Now
After this session, you should be able to:
- Explain why idempotency matters — Networks lose responses, users double-click
- Implement idempotency keys — Store, check, and return cached responses
- Handle the timeout problem — Status checks, recovery flows
- Design key strategies — Client vs server generated, TTL choices
- Document your strategy — Edge cases, monitoring, reconciliation
Connection to Day 1
Yesterday: "How do I stop waiting forever?" → Timeouts Today: "What if my timeout fires but the operation succeeded?" → Idempotency
Together, they solve the reliability problem:
- Timeouts prevent resource exhaustion
- Idempotency prevents duplicate effects
- Your system is both responsive AND correct
Key Trade-offs
| Decision | Trade-off |
|---|---|
| Shorter TTL | Less storage vs Risk of duplicates after expiry |
| Client keys | Simpler API vs Trust client implementation |
| Server keys | More control vs Extra round-trip |
| Strict matching | Detect conflicts vs Reject legitimate retries |
Questions to Ask in Every Design
- What operations are non-idempotent? (Usually writes)
- How will clients retry? (Same key or new key?)
- What's the deduplication window? (Minutes? Hours? Days?)
- How do we recover from ambiguous failures? (Status check APIs?)
- How do we reconcile mismatches? (Audit logs, daily jobs?)
Part V: Interview Questions and Answers
Chapter 12: Real-World Interview Scenarios
12.1 Conceptual Questions
Question 1: "What is idempotency and why is it important?"
Interviewer's Intent: Testing foundational understanding.
Strong Answer:
"Idempotency means an operation can be performed multiple times with the same effect as performing it once. It's crucial in distributed systems because networks are unreliable.
Consider a payment: the user clicks 'Pay', the bank charges them, but the response is lost. From the user's perspective, it failed. They click again. Without idempotency, they're charged twice.
The fundamental problem is that a client cannot distinguish between 'request never arrived' and 'response was lost.' Both look like timeout. Idempotency makes retry safe by remembering what we've processed.
HTTP methods like GET, PUT, DELETE are idempotent by definition. POST typically is not, which is why we add idempotency keys to POST endpoints for critical operations like payments."
Question 2: "Explain the difference between at-most-once, at-least-once, and exactly-once delivery."
Interviewer's Intent: Testing distributed systems knowledge.
Strong Answer:
"These describe message delivery guarantees:
At-most-once: Fire and forget. Send message, don't check if received. Message may be lost. Example: UDP, fire-and-forget events.
At-least-once: Retry until acknowledged. Message delivered one or more times. Duplicates possible. Example: most message queues with retry.
Exactly-once: Message delivered exactly one time. No loss, no duplicates.
Here's the key insight: exactly-once is impossible in a distributed system without cooperation from the receiver. What we actually implement is:
'At-least-once delivery with idempotent processing'
The sender retries until acknowledged (at-least-once). The receiver deduplicates (idempotent). The combination behaves like exactly-once from the application's perspective.
So when someone says their system supports exactly-once, they really mean: we handle duplicates for you."
Question 3: "How would you handle partial failures in an idempotent operation?"
Interviewer's Intent: Testing understanding of complex scenarios.
Strong Answer:
"Partial failures are tricky because the idempotency record might not reflect the true state.
Example: Payment flow
- Charge customer's card ✓
- Credit merchant's account ✗ (fails)
- Send receipt
The card was charged but merchant wasn't credited. What's our idempotency status?
Three approaches:
Approach 1: Atomic transactions Wrap all operations in a database transaction. Either all succeed or all fail. Idempotency record only written on full success.
with db.transaction():
charge_customer()
credit_merchant()
idempotency.complete(key) # Only reached if both succeed
Approach 2: Saga with compensation If step 2 fails, undo step 1 before marking failed:
charge_result = charge_customer()
try:
credit_merchant()
except:
refund_customer(charge_result) # Compensating action
idempotency.fail(key)
raise
idempotency.complete(key)
Approach 3: State machine Track which steps completed. On retry, resume from last successful step:
state = idempotency.get_state(key)
if state.step < 1:
charge_customer()
idempotency.set_step(key, 1)
if state.step < 2:
credit_merchant()
idempotency.set_step(key, 2)
idempotency.complete(key)
The best approach depends on whether operations can be undone and whether they're in the same database."
12.2 Design Questions
Question 4: "Design the idempotency layer for a payment API."
Interviewer's Intent: Testing end-to-end design.
Strong Answer:
"I'll design this layer by layer.
API Contract:
POST /v1/payments
Headers:
Idempotency-Key: <required, string, 16-256 chars>
Body:
{amount, currency, customer_id, payment_method}
Response:
{id, status, amount, created_at}
Key validation:
- Required for all POST requests
- 16-256 alphanumeric plus - and _
- Reject reserved prefixes (system_, test_)
Storage design:
CREATE TABLE idempotency_records (
key VARCHAR(256) PRIMARY KEY,
request_hash CHAR(64),
status VARCHAR(20), -- processing, completed, failed
response JSONB,
created_at TIMESTAMP,
completed_at TIMESTAMP,
expires_at TIMESTAMP
);
CREATE INDEX idx_expires ON idempotency_records(expires_at);
Plus Redis cache for hot lookups.
Processing flow:
def process_payment(request):
key = request.headers['Idempotency-Key']
# 1. Check cache
cached = redis.get(f"idem:{key}")
if cached and cached.status == 'completed':
return cached.response
# 2. Check/claim in database
with db.transaction():
existing = db.select_for_update(key)
if existing:
if existing.status == 'completed':
return existing.response
if existing.status == 'processing':
raise InProgressError()
# status == 'failed': allow retry
db.upsert(key, status='processing', request_hash=hash(request))
# 3. Process payment
try:
result = bank_api.charge(...)
# 4. Record success
db.update(key, status='completed', response=result)
redis.setex(f"idem:{key}", 3600, result)
return result
except Exception as e:
db.update(key, status='failed')
raise
Edge cases handled:
- Double click: Second request sees 'processing', waits or returns 409
- Retry after error: 'failed' status allows retry
- Different request same key: Compare request_hash, return 422
- TTL expiry: Cleanup job, but also check bank for existing transaction
Monitoring:
- Cache hit rate: Should be high
- Duplicate rate: Healthy system sees some (expected retries)
- Conflict rate: Should be near zero (indicates client bugs)"
Question 5: "How would you implement idempotency for a multi-step workflow?"
Interviewer's Intent: Testing complex scenario handling.
Strong Answer:
"Multi-step workflows need idempotency at both workflow and step level.
Example: Booking flow
- Reserve inventory
- Charge payment
- Confirm booking
- Send confirmation email
Approach: State machine with step tracking
class BookingWorkflow:
STEPS = ['reserve', 'charge', 'confirm', 'notify']
def execute(self, idempotency_key: str, booking_data: dict):
# Get or create workflow state
state = self.get_state(idempotency_key)
if state.status == 'completed':
return state.result
try:
# Resume from last successful step
for step in self.STEPS:
if step not in state.completed_steps:
result = self.execute_step(step, booking_data, state)
state.completed_steps.append(step)
state.step_results[step] = result
self.save_state(idempotency_key, state)
state.status = 'completed'
self.save_state(idempotency_key, state)
return state.result
except Exception as e:
state.status = 'failed'
state.error = str(e)
self.save_state(idempotency_key, state)
raise
def execute_step(self, step: str, data: dict, state: WorkflowState):
step_key = f"{state.idempotency_key}_{step}"
# Each step is also idempotent
if step == 'reserve':
return self.inventory.reserve(data['items'], idempotency_key=step_key)
elif step == 'charge':
return self.payments.charge(data['amount'], idempotency_key=step_key)
# ... etc
Key design decisions:
- Workflow-level key: Identifies the entire booking
- Step-level keys: Derived from workflow key, make each step idempotent
- State persistence: Save after each step so we can resume
- Compensation: If step 3 fails, we might need to undo steps 1-2
For compensation:
def execute_with_compensation(self, ...):
completed = []
try:
for step in self.STEPS:
result = self.execute_step(step, ...)
completed.append((step, result))
except:
# Rollback in reverse order
for step, result in reversed(completed):
self.compensate(step, result)
raise
This is basically the Saga pattern, which we'll cover more in Day 4."
12.3 Scenario-Based Questions
Question 6: "User reports being charged twice. How do you investigate?"
Interviewer's Intent: Testing operational skills.
Strong Answer:
"I'd investigate systematically:
Step 1: Gather facts
- Get user's account ID, approximate time
- Get any transaction IDs they have (email receipts)
Step 2: Check our records
SELECT * FROM payments
WHERE user_id = 'xyz'
AND created_at > '2024-01-01'
ORDER BY created_at;
Look for:
- Two transactions with same amount, close timestamps
- Different transaction IDs = genuinely processed twice
- Same transaction ID = UI bug showing twice
Step 3: Check idempotency records
SELECT * FROM idempotency_records
WHERE key LIKE 'pay_xyz_%'
AND created_at > '2024-01-01';
Look for:
- Two different keys (user retried with new key = our client bug)
- Same key, two completions (shouldn't happen, major bug)
- Key conflict errors (user might have reused key incorrectly)
Step 4: Check bank/payment processor
- Query their API for transactions
- Compare their records to ours
- They're the source of truth for actual charges
Step 5: Root cause
Likely causes:
- Client generated new key on retry: Fix client code
- Idempotency bypass: Some code path doesn't check idempotency
- TTL expired between attempts: Increase TTL
- Different API endpoints used: One idempotent, one not
Step 6: Remediation
- If double charge confirmed: Refund one transaction
- If our bug: Write postmortem, deploy fix
- If client bug: Help them fix, document better
Prevention:
- Add monitoring for duplicate-looking transactions
- Alert when same user has multiple charges within X minutes
- Reconciliation job comparing our records to bank"
Question 7: "How would you test idempotency implementation?"
Interviewer's Intent: Testing quality mindset.
Strong Answer:
"I'd test at multiple levels:
Unit tests: Core logic
def test_new_key_is_processed():
store = IdempotencyStore()
is_new, cached = store.check_and_set('key1', {'a': 1})
assert is_new == True
assert cached is None
def test_existing_completed_returns_cached():
store = IdempotencyStore()
store.check_and_set('key1', {'a': 1})
store.complete('key1', {'result': 'ok'})
is_new, cached = store.check_and_set('key1', {'a': 1})
assert is_new == False
assert cached == {'result': 'ok'}
def test_different_request_same_key_raises():
store = IdempotencyStore()
store.check_and_set('key1', {'a': 1})
with pytest.raises(IdempotencyConflictError):
store.check_and_set('key1', {'a': 2}) # Different request!
Integration tests: Full flow
def test_duplicate_payment_deduplicated():
key = 'pay_test123'
# First request
resp1 = client.post('/payments',
headers={'Idempotency-Key': key},
json={'amount': 100})
assert resp1.status == 200
# Second request (same key)
resp2 = client.post('/payments',
headers={'Idempotency-Key': key},
json={'amount': 100})
assert resp2.status == 200
assert resp2.json() == resp1.json()
# Verify only one charge
charges = bank_mock.get_charges()
assert len(charges) == 1
Concurrency tests
def test_concurrent_same_key():
key = 'pay_concurrent'
results = []
def make_request():
resp = client.post('/payments',
headers={'Idempotency-Key': key},
json={'amount': 100})
results.append(resp)
# Fire 10 requests simultaneously
threads = [Thread(target=make_request) for _ in range(10)]
[t.start() for t in threads]
[t.join() for t in threads]
# All should get same response
assert len(set(r.json()['id'] for r in results)) == 1
# Only one actual charge
assert len(bank_mock.get_charges()) == 1
Failure injection tests
def test_timeout_then_retry():
key = 'pay_timeout'
# Make bank slow on first call
bank_mock.set_delay(10) # 10 seconds
with pytest.raises(Timeout):
client.post('/payments',
headers={'Idempotency-Key': key},
json={'amount': 100},
timeout=1)
# Bank actually processed it (slowly)
bank_mock.set_delay(0)
# Retry should return the result
resp = client.post('/payments',
headers={'Idempotency-Key': key},
json={'amount': 100})
assert resp.json()['status'] == 'success'
assert len(bank_mock.get_charges()) == 1
Production verification
- Deploy to staging, run load test with duplicates
- Monitor: duplicate detection rate, conflict rate
- Shadow mode: Log duplicates without deduplicating, compare to expected"
12.4 Deep-Dive Questions
Question 8: "How do systems like Stripe implement idempotency?"
Interviewer's Intent: Testing industry knowledge.
Strong Answer:
"Stripe's approach is well-documented and is the industry standard:
Key format:
- Client provides
Idempotency-Keyheader - Max 255 characters
- Any string works, but they recommend UUIDs
Behavior:
- Keys are scoped to API key (different accounts can use same key)
- 24-hour TTL after first use
- Replays must have identical request body
- Returns cached response with original status code
Status tracking:
- Requests in flight are tracked
- Concurrent duplicate returns 409 with 'request in progress'
- After completion, returns cached response
What makes it good:
-
Request fingerprinting: They hash the request and compare. Same key + different body = error.
-
Saved for 24 hours: Generous window for retries.
-
Scoped to API key: Prevents conflicts between merchants.
-
Works across their whole API: Consistent behavior everywhere.
Implementation insights from their blog:
They store:
- Key
- Request hash
- Response (full, including status code)
- Created at
- API key ID
They handle racing requests with database locks, ensuring only one executes while others wait.
They recommend clients:
- Generate UUID for each logical operation
- Store key with the operation until confirmed
- Retry with same key on network errors
This is the pattern I'd implement. It's battle-tested at massive scale."
Question 9: "Compare idempotency at the API layer vs the database layer."
Interviewer's Intent: Testing architectural thinking.
Strong Answer:
"Both layers can provide idempotency, but they solve different problems.
API layer idempotency:
Where: Application code before business logic What it catches: Duplicate API calls (user clicks twice, client retries) How: Check idempotency key, return cached response
@app.post('/payments')
def create_payment(request):
if idempotency_exists(request.key):
return cached_response(request.key)
# ... process payment
Pros: Catches duplicates early, saves processing time Cons: Doesn't catch bugs in your own code
Database layer idempotency:
Where: Database constraints What it catches: Duplicate records from any source How: Unique constraints on business keys
CREATE TABLE payments (
id SERIAL PRIMARY KEY,
idempotency_key VARCHAR(255) UNIQUE,
-- OR business key constraint
UNIQUE(user_id, order_id)
);
Pros: Last line of defense, catches bugs in your code Cons: Happens late, after processing work
Best practice: Use both!
Request → API Idempotency Check → Business Logic → Database Constraint
↓ ↓
(Fast rejection) (Final safety net)
API layer is the primary defense: fast, returns cached response. Database constraint is backup: catches bugs you didn't anticipate.
The database constraint should rarely trigger in production. If it does frequently, that's a sign your API layer idempotency is broken.
For payments specifically, I'd also add:
- Application-level check of recent transactions
- Reconciliation with payment processor
- Alerts on constraint violations
Defense in depth is the right approach for critical flows."
Chapter 13: Interview Preparation Checklist
Before your interview, make sure you can:
Concepts
- Explain idempotency with a real-world example
- Describe the network timeout problem
- Compare at-most-once, at-least-once, exactly-once
Implementation
- Design an idempotency key storage system
- Handle concurrent requests with same key
- Recover from ambiguous failures (timeout)
Design Decisions
- Compare client vs server generated keys
- Choose appropriate TTL for different scenarios
- Handle partial failures in multi-step operations
Operations
- Investigate duplicate charge reports
- Design monitoring for idempotency layer
- Write tests for idempotency behavior
Exercises
Exercise 1: Implement Full Idempotency Layer
Build an idempotency system that:
- Stores keys in Redis with PostgreSQL backup
- Handles concurrent requests safely
- Supports status polling for long operations
- Includes metrics and logging
Exercise 2: Idempotency for Order Creation
Design idempotency for an order system where:
- Users can place orders
- Each order has multiple items
- Inventory must be reserved
- Payment must be processed
- User might click "Place Order" twice
Exercise 3: Reconciliation System
Build a reconciliation job that:
- Compares your payment records to bank records
- Identifies mismatches (duplicate charges, missing records)
- Generates alerts for human review
- Runs daily and produces a report
Further Reading
- Stripe API Documentation: Idempotent Requests
- AWS: Building Distributed Applications: Idempotency patterns
- "Designing Data-Intensive Applications": Chapter on distributed transactions
- Brandur Leach's Blog: Implementing Stripe-like Idempotency Keys in Postgres
Appendix: Complete Idempotency Implementation
"""
Production-ready idempotency implementation.
Builds on Day 1's payment service, adding retry safety.
"""
import hashlib
import json
import time
import asyncio
import logging
from dataclasses import dataclass, field, asdict
from typing import Optional, Dict, Any, Callable, Awaitable
from datetime import datetime, timedelta
from enum import Enum
import redis.asyncio as redis
import asyncpg
from prometheus_client import Counter, Histogram
# =============================================================================
# Metrics
# =============================================================================
idempotency_checks = Counter(
'idempotency_checks_total',
'Idempotency check results',
['result'] # new, cached, conflict, in_progress
)
idempotency_latency = Histogram(
'idempotency_check_seconds',
'Idempotency check latency',
buckets=[.001, .005, .01, .025, .05, .1, .25, .5]
)
# =============================================================================
# Types
# =============================================================================
class IdempotencyStatus(Enum):
PROCESSING = 'processing'
COMPLETED = 'completed'
FAILED = 'failed'
@dataclass
class IdempotencyRecord:
key: str
request_hash: str
status: IdempotencyStatus
response: Optional[Dict] = None
http_status: Optional[int] = None
created_at: datetime = field(default_factory=datetime.utcnow)
completed_at: Optional[datetime] = None
user_id: Optional[str] = None
def to_dict(self) -> Dict:
return {
'key': self.key,
'request_hash': self.request_hash,
'status': self.status.value,
'response': self.response,
'http_status': self.http_status,
'created_at': self.created_at.isoformat(),
'completed_at': self.completed_at.isoformat() if self.completed_at else None,
'user_id': self.user_id,
}
@classmethod
def from_dict(cls, data: Dict) -> 'IdempotencyRecord':
return cls(
key=data['key'],
request_hash=data['request_hash'],
status=IdempotencyStatus(data['status']),
response=data.get('response'),
http_status=data.get('http_status'),
created_at=datetime.fromisoformat(data['created_at']),
completed_at=datetime.fromisoformat(data['completed_at']) if data.get('completed_at') else None,
user_id=data.get('user_id'),
)
# =============================================================================
# Exceptions
# =============================================================================
class IdempotencyError(Exception):
pass
class IdempotencyConflictError(IdempotencyError):
"""Key reused with different request."""
pass
class IdempotencyInProgressError(IdempotencyError):
"""Previous request still processing."""
pass
# =============================================================================
# Main Implementation
# =============================================================================
class IdempotencyStore:
"""
Production idempotency store with Redis cache and PostgreSQL persistence.
Features:
- Fast path through Redis for cache hits
- PostgreSQL for durability and recovery
- Atomic claim using database locks
- Automatic TTL management
"""
def __init__(
self,
redis_client: redis.Redis,
pg_pool: asyncpg.Pool,
cache_ttl: timedelta = timedelta(hours=1),
record_ttl: timedelta = timedelta(days=7),
):
self.redis = redis_client
self.pg = pg_pool
self.cache_ttl = cache_ttl
self.record_ttl = record_ttl
self.logger = logging.getLogger('idempotency')
def _hash_request(self, request_body: Dict) -> str:
"""Create deterministic hash of request."""
serialized = json.dumps(request_body, sort_keys=True, default=str)
return hashlib.sha256(serialized.encode()).hexdigest()
def _redis_key(self, key: str) -> str:
return f"idempotency:{key}"
async def check_and_claim(
self,
key: str,
request_body: Dict,
user_id: Optional[str] = None
) -> tuple[bool, Optional[IdempotencyRecord]]:
"""
Check if key exists and claim it if not.
Returns:
(is_new, record)
- (True, None): Key is new, claimed for processing
- (False, record): Key exists, return cached record
Raises:
IdempotencyConflictError: Key exists with different request
IdempotencyInProgressError: Previous request still processing
"""
start = time.time()
request_hash = self._hash_request(request_body)
try:
# Fast path: check Redis cache
cached = await self._check_cache(key, request_hash)
if cached:
idempotency_checks.labels(result='cached').inc()
return (False, cached)
# Slow path: check and claim in database
return await self._check_and_claim_db(key, request_hash, user_id)
finally:
idempotency_latency.observe(time.time() - start)
async def _check_cache(self, key: str, request_hash: str) -> Optional[IdempotencyRecord]:
"""Check Redis cache for completed request."""
redis_key = self._redis_key(key)
cached = await self.redis.get(redis_key)
if not cached:
return None
record = IdempotencyRecord.from_dict(json.loads(cached))
# Validate request hash
if record.request_hash != request_hash:
raise IdempotencyConflictError(
f"Key '{key}' was used with a different request"
)
if record.status == IdempotencyStatus.COMPLETED:
return record
if record.status == IdempotencyStatus.PROCESSING:
raise IdempotencyInProgressError(
f"Request '{key}' is still being processed"
)
# Failed - allow retry (will go to DB path)
return None
async def _check_and_claim_db(
self,
key: str,
request_hash: str,
user_id: Optional[str]
) -> tuple[bool, Optional[IdempotencyRecord]]:
"""Check and claim in database with locking."""
async with self.pg.acquire() as conn:
async with conn.transaction():
# Try to get existing record with lock
existing = await conn.fetchrow("""
SELECT * FROM idempotency_records
WHERE key = $1
FOR UPDATE
""", key)
if existing:
record = self._row_to_record(existing)
# Validate request hash
if record.request_hash != request_hash:
idempotency_checks.labels(result='conflict').inc()
raise IdempotencyConflictError(
f"Key '{key}' was used with a different request"
)
if record.status == IdempotencyStatus.COMPLETED:
# Cache for future fast-path
await self._cache_record(record)
idempotency_checks.labels(result='cached').inc()
return (False, record)
if record.status == IdempotencyStatus.PROCESSING:
# Check if stuck (processing for too long)
if self._is_stuck(record):
# Allow retry - reset to processing
await self._reset_record(conn, key)
idempotency_checks.labels(result='new').inc()
return (True, None)
idempotency_checks.labels(result='in_progress').inc()
raise IdempotencyInProgressError(
f"Request '{key}' is still being processed"
)
# Failed - allow retry
await conn.execute("""
UPDATE idempotency_records
SET status = 'processing', completed_at = NULL
WHERE key = $1
""", key)
idempotency_checks.labels(result='new').inc()
return (True, None)
# New key - insert
await conn.execute("""
INSERT INTO idempotency_records
(key, request_hash, status, created_at, user_id, expires_at)
VALUES ($1, $2, 'processing', $3, $4, $5)
""", key, request_hash, datetime.utcnow(), user_id,
datetime.utcnow() + self.record_ttl)
idempotency_checks.labels(result='new').inc()
return (True, None)
def _is_stuck(self, record: IdempotencyRecord) -> bool:
"""Check if a processing record is stuck (> 5 minutes)."""
stuck_threshold = timedelta(minutes=5)
return datetime.utcnow() - record.created_at > stuck_threshold
async def _reset_record(self, conn, key: str):
"""Reset a stuck record for retry."""
await conn.execute("""
UPDATE idempotency_records
SET status = 'processing',
created_at = $1,
completed_at = NULL
WHERE key = $2
""", datetime.utcnow(), key)
async def complete(
self,
key: str,
response: Dict,
http_status: int = 200
):
"""Mark request as completed with response."""
now = datetime.utcnow()
# Update database
async with self.pg.acquire() as conn:
await conn.execute("""
UPDATE idempotency_records
SET status = 'completed',
response = $1,
http_status = $2,
completed_at = $3
WHERE key = $4
""", json.dumps(response), http_status, now, key)
# Get full record for caching
row = await conn.fetchrow(
"SELECT * FROM idempotency_records WHERE key = $1", key
)
# Update cache
if row:
record = self._row_to_record(row)
await self._cache_record(record)
async def fail(self, key: str, error: str):
"""Mark request as failed."""
async with self.pg.acquire() as conn:
await conn.execute("""
UPDATE idempotency_records
SET status = 'failed',
response = $1,
completed_at = $2
WHERE key = $3
""", json.dumps({'error': error}), datetime.utcnow(), key)
# Remove from cache (allow retry)
await self.redis.delete(self._redis_key(key))
async def _cache_record(self, record: IdempotencyRecord):
"""Cache completed record in Redis."""
redis_key = self._redis_key(record.key)
await self.redis.setex(
redis_key,
int(self.cache_ttl.total_seconds()),
json.dumps(record.to_dict())
)
def _row_to_record(self, row) -> IdempotencyRecord:
"""Convert database row to record."""
return IdempotencyRecord(
key=row['key'],
request_hash=row['request_hash'],
status=IdempotencyStatus(row['status']),
response=json.loads(row['response']) if row['response'] else None,
http_status=row['http_status'],
created_at=row['created_at'],
completed_at=row['completed_at'],
user_id=row['user_id'],
)
# =============================================================================
# High-Level Wrapper
# =============================================================================
async def with_idempotency(
store: IdempotencyStore,
key: str,
request_body: Dict,
processor: Callable[[], Awaitable[Dict]],
user_id: Optional[str] = None
) -> tuple[Dict, int]:
"""
Execute processor with idempotency protection.
Returns: (response_body, http_status)
"""
# Check/claim
is_new, record = await store.check_and_claim(key, request_body, user_id)
if not is_new:
# Return cached response
return (record.response, record.http_status)
# Process
try:
response = await processor()
await store.complete(key, response, http_status=200)
return (response, 200)
except Exception as e:
await store.fail(key, str(e))
raise
# =============================================================================
# Database Schema
# =============================================================================
SCHEMA = """
CREATE TABLE IF NOT EXISTS idempotency_records (
key VARCHAR(256) PRIMARY KEY,
request_hash CHAR(64) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'processing',
response JSONB,
http_status INTEGER,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
completed_at TIMESTAMP,
user_id VARCHAR(256),
expires_at TIMESTAMP NOT NULL,
CHECK (status IN ('processing', 'completed', 'failed'))
);
CREATE INDEX IF NOT EXISTS idx_idempotency_expires
ON idempotency_records(expires_at);
CREATE INDEX IF NOT EXISTS idx_idempotency_user
ON idempotency_records(user_id);
-- Cleanup job (run daily)
-- DELETE FROM idempotency_records WHERE expires_at < NOW();
"""
# =============================================================================
# Example Usage with Payment Service
# =============================================================================
async def example_usage():
"""Example of using idempotency with payment service."""
# Setup
redis_client = redis.from_url('redis://localhost')
pg_pool = await asyncpg.create_pool('postgresql://localhost/payments')
store = IdempotencyStore(redis_client, pg_pool)
payment_service = PaymentService(...) # From Day 1
# Handle payment request
async def handle_payment(request):
key = request.headers.get('Idempotency-Key')
if not key:
return Response(status=400, body={'error': 'Idempotency-Key required'})
async def process():
result = await payment_service.process_payment(
user_id=request.json['user_id'],
amount=request.json['amount'],
idempotency_key=key
)
return result.to_dict()
try:
response, status = await with_idempotency(
store=store,
key=key,
request_body=request.json,
processor=process,
user_id=request.json.get('user_id')
)
return Response(status=status, body=response)
except IdempotencyConflictError:
return Response(status=422, body={
'error': 'Idempotency key was used with different request parameters'
})
except IdempotencyInProgressError:
return Response(status=409, body={
'error': 'Request is still being processed. Please retry shortly.'
})
End of Day 2: Idempotency in Practice
Tomorrow: Day 3 — Circuit Breakers. We've solved the timeout problem (Day 1) and the duplicate problem (Day 2). But what happens when a downstream service is failing repeatedly? We don't want to keep trying and waiting. Circuit breakers let us fail fast and give struggling services time to recover.