Week 3 — Day 4: Dead Letters and Poison Pills
System Design Mastery Series
Preface
2 AM. PagerDuty goes off. The order processing system is frozen.
Investigation reveals: a single malformed order event has been crashing the consumer for the past 4 hours. Each time the consumer restarts, it fetches the same message, crashes, restarts, crashes again—an infinite loop of suffering.
The Poison Pill Death Spiral:
00:00 - Consumer processes message #4,521,892
00:00 - Message contains invalid JSON in 'items' field
00:00 - Consumer throws exception, crashes
00:01 - Kubernetes restarts consumer
00:01 - Consumer fetches message #4,521,892 (same one!)
00:01 - Crash again
00:02 - Restart again
00:02 - Crash again
...
04:00 - 240 crashes later, on-call engineer wakes up
04:00 - 50,000 messages stuck behind the poison pill
04:00 - Customers can't check out
04:00 - Revenue loss: $847,000
The message that killed a system:
{
"order_id": "ord_123",
"user_id": "usr_456",
"items": "INVALID - should be array, got string", // ← The poison
"total": 99.99
}
Yesterday, we learned about backpressure—handling too many messages. Today, we face a different problem:
What do you do with messages that simply cannot be processed?
These are poison pills—messages that crash your consumers no matter how many times you retry. And the solution is the dead letter queue (DLQ)—a quarantine zone for failed messages.
Today, we'll learn how to implement DLQs, debug poison pills, and build systems that fail gracefully instead of catastrophically.
Part I: Foundations
Chapter 1: What Are Dead Letters and Poison Pills?
1.1 The Simple Definitions
Poison Pill: A message that cannot be successfully processed, no matter how many times you retry. It "poisons" your consumer by causing repeated failures.
Dead Letter Queue (DLQ): A special queue where failed messages go after exceeding retry limits. It's a quarantine zone—messages can be inspected, fixed, and replayed later.
EVERYDAY ANALOGY: The Post Office
Imagine the postal system:
Normal delivery:
Letter → Sorting → Delivery → Recipient ✓
But some letters can't be delivered:
- Wrong address (validation error)
- Recipient moved (transient failure)
- House doesn't exist (permanent failure)
- Explosive device (poison pill!)
What happens to undeliverable mail?
It goes to the DEAD LETTER OFFICE.
There, postal workers:
1. Try to figure out the intended recipient
2. Return to sender if possible
3. Hold for investigation if suspicious
4. Eventually destroy if unsalvageable
Your DLQ serves the same purpose for messages.
1.2 Why Messages Fail
Messages can fail for many reasons. Understanding the failure type determines how to handle it:
FAILURE TAXONOMY
TRANSIENT FAILURES (Retry will help)
─────────────────────────────────────
• Database temporarily unavailable
• Network timeout
• Rate limit exceeded
• Downstream service restarting
→ Solution: Retry with exponential backoff
PERMANENT FAILURES (Retry won't help)
─────────────────────────────────────
• Invalid message format (bad JSON)
• Schema mismatch (missing required field)
• Business rule violation (negative price)
• Referenced entity doesn't exist
→ Solution: Send to DLQ, fix the data or code
POISON PILLS (Crashes the consumer)
───────────────────────────────────
• Causes unhandled exception
• Triggers infinite loop
• Exhausts memory
• Corrupts consumer state
→ Solution: IMMEDIATELY send to DLQ, investigate
CLASSIFICATION CHALLENGE:
The hard part is distinguishing between them:
- Is this timeout transient or permanent?
- Will retrying make it worse?
- Is the downstream service recovering or dead?
Rule of thumb: After N retries, assume permanent.
1.3 The Dead Letter Queue Pattern
MESSAGE FLOW WITH DLQ
Normal path:
Producer → Main Queue → Consumer → Success ✓
Failure path:
Producer → Main Queue → Consumer → Failure!
│
▼
Retry Logic
│
┌─────────┴─────────┐
│ │
Retry < Max Retry >= Max
│ │
▼ ▼
Back to Queue Dead Letter Queue
│ │
▼ │
Consumer │
│ │
Eventually: │
Success ✓ │
or │
DLQ if exhausted │
▼
Manual Investigation
│
┌─────────┴─────────┐
│ │
Fix & Replay Discard
1.4 Key Terminology
| Term | Definition |
|---|---|
| Dead Letter Queue (DLQ) | Queue holding messages that couldn't be processed |
| Poison Pill | Message that crashes or hangs the consumer |
| Retry Count | Number of times processing has been attempted |
| Max Retries | Threshold before sending to DLQ |
| Visibility Timeout | Time message is hidden after being fetched (SQS) |
| Redelivery | Re-sending a message for another processing attempt |
| Replay | Manually reprocessing messages from DLQ |
Chapter 2: DLQ Patterns and Strategies
2.1 Pattern 1: Simple DLQ (Catch-All)
The simplest approach: one DLQ for all failures.
SIMPLE DLQ ARCHITECTURE
Main Queue: orders
│
▼
Consumer
│
├── Success → Commit offset
│
└── Failure → Retry up to 3 times
│
└── Still failing → orders.dlq
DLQ Schema:
{
"original_message": { ... },
"error": "ValueError: Invalid item quantity",
"failed_at": "2024-01-15T10:30:00Z",
"retry_count": 3,
"consumer_id": "orders-consumer-1",
"stack_trace": "..."
}
Pros:
✓ Simple to implement
✓ All failures in one place
✓ Easy to monitor (one queue depth metric)
Cons:
✗ Different failure types mixed together
✗ Replay logic must handle all error types
✗ Can't prioritize critical vs non-critical failures
2.2 Pattern 2: Categorized DLQs
Separate DLQs based on failure type or severity.
CATEGORIZED DLQ ARCHITECTURE
Main Queue: orders
│
▼
Consumer
│
├── Success → Commit
│
└── Failure → Classify error
│
┌──────────────┼──────────────┐
│ │ │
Validation Transient Unknown
Error Error Error
│ │ │
▼ ▼ ▼
orders.dlq orders.dlq orders.dlq
.validation .transient .unknown
Benefits of categorization:
Validation DLQ:
- Contains fixable data issues
- Can build UI for manual correction
- Auto-fix common issues (trim whitespace, etc.)
Transient DLQ:
- Automatic retry after delay
- Often self-resolves
- Alert if persists > 1 hour
Unknown DLQ:
- Requires engineering investigation
- Highest priority alerts
- May indicate bugs in consumer code
2.3 Pattern 3: Delayed Retry Queue
Instead of immediate DLQ, use delayed reprocessing.
DELAYED RETRY ARCHITECTURE
Main Queue → Consumer → Failure!
│
▼
Retry Counter
│
┌───────────────┼───────────────┐
│ │ │
Retry 1 Retry 2 Retry 3+
Wait 1s Wait 10s Wait 60s
│ │ │
▼ ▼ ▼
Delay Queue Delay Queue Delay Queue
(1 second) (10 seconds) (60 seconds)
│ │ │
└───────────────┴───────┬───────┘
│
▼
Back to Main Queue
│
▼
Still failing after 5 retries?
│
▼
DLQ
Implementation with Kafka:
Retry topics:
orders.retry.1 (1 second delay)
orders.retry.2 (10 second delay)
orders.retry.3 (60 second delay)
Each retry topic has consumer that:
1. Sleeps for the delay duration
2. Republishes to main topic
3. Or sends to DLQ if max retries exceeded
2.4 Pattern 4: Error-Specific Handling
Different error types get different treatment.
# Error Classification and Routing
from enum import Enum
from typing import Optional
import json
class ErrorCategory(Enum):
VALIDATION = "validation" # Bad data, won't retry
TRANSIENT = "transient" # Might succeed later
DEPENDENCY = "dependency" # External service issue
POISON = "poison" # Crashes consumer
UNKNOWN = "unknown" # Unclassified
ERROR_ROUTING = {
ErrorCategory.VALIDATION: {
"retry": False,
"dlq": "orders.dlq.validation",
"alert": False, # Expected errors
},
ErrorCategory.TRANSIENT: {
"retry": True,
"max_retries": 5,
"backoff": "exponential",
"dlq": "orders.dlq.transient",
"alert": True, # Alert if DLQ'd
},
ErrorCategory.DEPENDENCY: {
"retry": True,
"max_retries": 10,
"backoff": "linear",
"dlq": "orders.dlq.dependency",
"alert": True,
},
ErrorCategory.POISON: {
"retry": False, # Never retry poison
"dlq": "orders.dlq.poison",
"alert": True, # Immediate alert
"page_oncall": True, # Wake someone up
},
ErrorCategory.UNKNOWN: {
"retry": True,
"max_retries": 3,
"dlq": "orders.dlq.unknown",
"alert": True,
},
}
def classify_error(error: Exception) -> ErrorCategory:
"""Classify an exception into an error category."""
# Validation errors
if isinstance(error, (ValueError, json.JSONDecodeError, KeyError)):
return ErrorCategory.VALIDATION
# Transient errors
if isinstance(error, (TimeoutError, ConnectionError)):
return ErrorCategory.TRANSIENT
# Dependency errors
if "503" in str(error) or "rate limit" in str(error).lower():
return ErrorCategory.DEPENDENCY
# Poison indicators
if isinstance(error, (MemoryError, RecursionError)):
return ErrorCategory.POISON
return ErrorCategory.UNKNOWN
2.5 Pattern Comparison
| Pattern | Complexity | Best For | Drawback |
|---|---|---|---|
| Simple DLQ | Low | Small systems, getting started | All errors mixed |
| Categorized | Medium | Mature systems | More queues to manage |
| Delayed Retry | Medium | High transient failure rate | Delay adds latency |
| Error-Specific | High | Complex systems | Requires good classification |
Chapter 3: Trade-offs and Considerations
3.1 Retry Count Trade-offs
HOW MANY RETRIES?
Too few retries (1-2):
- Transient errors go to DLQ unnecessarily
- DLQ fills up with recoverable messages
- More manual work to replay
Too many retries (10+):
- Poison pills block processing longer
- Resource waste on hopeless retries
- Delays detection of real problems
Sweet spot depends on:
- Your transient failure rate
- Time to recover from transient issues
- Cost of DLQ investigation
COMMON CONFIGURATIONS:
Low-latency systems: 2-3 retries, fast backoff
High-reliability: 5-7 retries, longer backoff
Batch processing: 10+ retries, very long backoff
ADAPTIVE RETRIES:
class AdaptiveRetryPolicy:
def get_max_retries(self, error_category: ErrorCategory) -> int:
return {
ErrorCategory.VALIDATION: 0, # Never retry
ErrorCategory.TRANSIENT: 5,
ErrorCategory.DEPENDENCY: 10,
ErrorCategory.POISON: 0, # Never retry
ErrorCategory.UNKNOWN: 3,
}[error_category]
3.2 DLQ Retention Trade-offs
HOW LONG TO KEEP DLQ MESSAGES?
Short retention (1-7 days):
✓ Less storage cost
✓ Forces prompt investigation
✗ May lose data if team is slow
✗ No historical analysis
Long retention (30-90 days):
✓ Time to investigate complex issues
✓ Historical pattern analysis
✗ Storage costs
✗ Stale messages may not replay correctly
Infinite retention:
✓ Never lose data
✗ Growing storage forever
✗ Old messages become irrelevant
RECOMMENDED APPROACH:
1. Keep hot DLQ for 7 days (fast access)
2. Archive to cold storage (S3) for 90 days
3. Delete after 90 days unless flagged
Schema evolution consideration:
Messages from 90 days ago may not match current schema.
Include schema version in DLQ messages.
3.3 When NOT to Use a DLQ
SKIP THE DLQ WHEN:
1. Messages are fully idempotent AND retriable
Just retry forever with backoff
Eventually it will work or alert
2. Data loss is acceptable
Metrics, logs, analytics
Drop and move on
3. Real-time requirements
If it fails, it's too late anyway
Log the error, move on
4. You have a better recovery mechanism
Event sourcing: rebuild from event log
CQRS: reproject from source
USE DLQ WHEN:
1. Every message must eventually succeed
Orders, payments, critical workflows
2. You need visibility into failures
Debugging, compliance, auditing
3. Manual intervention is expected
Data correction, customer support
4. Replay is part of your recovery plan
Fix bug, replay failed messages
3.4 Decision Framework
SHOULD THIS MESSAGE GO TO DLQ?
Start: Message processing failed
│
▼
Is it a poison pill (crashes consumer)?
│
├── YES → DLQ immediately, alert on-call
│
└── NO → Is it a validation error?
│
├── YES → Can it be auto-fixed?
│ │
│ ┌────┴────┐
│ YES NO
│ │ │
│ Fix & DLQ
│ Retry (validation)
│
└── NO → Retry count < max?
│
┌────┴────┐
YES NO
│ │
Retry DLQ
w/backoff (exhausted)
Part II: Implementation
Chapter 4: Basic Implementation
4.1 Simple DLQ Consumer
# Basic DLQ Implementation
# WARNING: Not production-ready - for learning only
import asyncio
import json
from datetime import datetime
from typing import Any, Dict, Optional
from dataclasses import dataclass, asdict
@dataclass
class DeadLetter:
"""A message that failed processing."""
original_message: Dict[str, Any]
error_type: str
error_message: str
stack_trace: str
failed_at: str
retry_count: int
consumer_id: str
topic: str
partition: Optional[int] = None
offset: Optional[int] = None
class SimpleDLQConsumer:
"""
Consumer with basic dead letter queue support.
"""
def __init__(
self,
kafka_consumer,
kafka_producer,
dlq_topic: str,
max_retries: int = 3,
consumer_id: str = "consumer-1"
):
self.consumer = kafka_consumer
self.producer = kafka_producer
self.dlq_topic = dlq_topic
self.max_retries = max_retries
self.consumer_id = consumer_id
async def process_message(self, message: Dict[str, Any]) -> None:
"""Override this to implement your processing logic."""
raise NotImplementedError
async def run(self):
"""Main consumer loop with DLQ handling."""
async for msg in self.consumer:
retry_count = msg.headers.get('retry_count', 0)
try:
# Parse message
value = json.loads(msg.value)
# Process
await self.process_message(value)
# Commit on success
await self.consumer.commit()
except Exception as e:
await self._handle_failure(msg, e, retry_count)
async def _handle_failure(self, msg, error: Exception, retry_count: int):
"""Handle a processing failure."""
import traceback
if retry_count < self.max_retries:
# Retry: republish with incremented count
print(f"Retrying message (attempt {retry_count + 1}/{self.max_retries})")
await self._retry_message(msg, retry_count + 1)
else:
# Max retries exceeded: send to DLQ
print(f"Max retries exceeded, sending to DLQ: {error}")
await self._send_to_dlq(msg, error, retry_count)
# Commit to move past this message
await self.consumer.commit()
async def _retry_message(self, msg, new_retry_count: int):
"""Republish message for retry."""
headers = dict(msg.headers) if msg.headers else {}
headers['retry_count'] = str(new_retry_count)
await self.producer.send(
msg.topic,
value=msg.value,
key=msg.key,
headers=list(headers.items())
)
async def _send_to_dlq(self, msg, error: Exception, retry_count: int):
"""Send failed message to dead letter queue."""
import traceback
dead_letter = DeadLetter(
original_message=json.loads(msg.value),
error_type=type(error).__name__,
error_message=str(error),
stack_trace=traceback.format_exc(),
failed_at=datetime.utcnow().isoformat(),
retry_count=retry_count,
consumer_id=self.consumer_id,
topic=msg.topic,
partition=msg.partition,
offset=msg.offset
)
await self.producer.send(
self.dlq_topic,
value=json.dumps(asdict(dead_letter)).encode(),
key=msg.key
)
# Usage example
class OrderConsumer(SimpleDLQConsumer):
async def process_message(self, message: Dict[str, Any]) -> None:
order_id = message['order_id']
items = message['items'] # May raise KeyError
if not isinstance(items, list):
raise ValueError(f"Items must be a list, got {type(items)}")
# Process order...
print(f"Processing order {order_id} with {len(items)} items")
4.2 Understanding the Flow
DLQ MESSAGE LIFECYCLE
1. Producer sends message
Topic: orders
Message: {"order_id": "123", "items": "invalid"}
2. Consumer receives message
Attempt 1: ValueError - items not a list
3. Consumer retries (message republished)
Attempt 2: ValueError - same error
4. Consumer retries again
Attempt 3: ValueError - still failing
5. Max retries exceeded → DLQ
Topic: orders.dlq
Message: {
"original_message": {"order_id": "123", "items": "invalid"},
"error_type": "ValueError",
"error_message": "Items must be a list, got <class 'str'>",
"stack_trace": "...",
"failed_at": "2024-01-15T10:30:00Z",
"retry_count": 3
}
6. Manual investigation
Engineer sees message in DLQ
Identifies bug in producer
Fixes producer
7. Replay from DLQ
Fix the message format
Republish to orders topic
Message processes successfully
Chapter 5: Production-Ready Implementation
5.1 Requirements for Production
- Error classification — Different errors need different handling
- Exponential backoff — Don't hammer failing systems
- Circuit breaker integration — Stop retrying during outages
- Metrics and alerting — Know when DLQ is growing
- Replay tooling — Easy to investigate and replay
- Poison pill detection — Catch crashes before they loop
5.2 Complete Production Implementation
# Production-Ready Dead Letter Queue Implementation
import asyncio
import json
import traceback
import hashlib
from dataclasses import dataclass, field, asdict
from datetime import datetime, timedelta
from typing import Optional, Dict, Any, List, Callable, Set
from enum import Enum
import logging
import time
logger = logging.getLogger(__name__)
# =============================================================================
# Configuration
# =============================================================================
@dataclass
class DLQConfig:
"""Configuration for dead letter queue handling."""
# Retry configuration
max_retries: int = 5
base_backoff_seconds: float = 1.0
max_backoff_seconds: float = 300.0
backoff_multiplier: float = 2.0
# DLQ topics
dlq_topic_suffix: str = ".dlq"
validation_dlq_suffix: str = ".dlq.validation"
poison_dlq_suffix: str = ".dlq.poison"
# Poison pill detection
poison_detection_enabled: bool = True
poison_timeout_seconds: float = 30.0
# Alerting thresholds
dlq_alert_threshold: int = 100
dlq_critical_threshold: int = 1000
class ErrorCategory(Enum):
VALIDATION = "validation"
TRANSIENT = "transient"
DEPENDENCY = "dependency"
POISON = "poison"
UNKNOWN = "unknown"
@dataclass
class DeadLetter:
"""A message that failed processing."""
id: str
original_topic: str
original_partition: int
original_offset: int
original_key: Optional[str]
original_value: Dict[str, Any]
original_headers: Dict[str, str]
error_category: str
error_type: str
error_message: str
stack_trace: str
first_failed_at: str
last_failed_at: str
retry_count: int
consumer_id: str
consumer_version: str
# For replay tracking
replayed: bool = False
replayed_at: Optional[str] = None
replay_result: Optional[str] = None
# =============================================================================
# Error Classification
# =============================================================================
class ErrorClassifier:
"""Classifies exceptions into error categories."""
# Exceptions that indicate validation errors
VALIDATION_ERRORS = (
ValueError,
KeyError,
TypeError,
json.JSONDecodeError,
)
# Exceptions that indicate transient errors
TRANSIENT_ERRORS = (
TimeoutError,
ConnectionError,
ConnectionResetError,
)
# Error message patterns
TRANSIENT_PATTERNS = [
"timeout",
"connection refused",
"temporarily unavailable",
"try again",
"rate limit",
"503",
"504",
]
POISON_PATTERNS = [
"out of memory",
"recursion",
"stack overflow",
"segmentation fault",
]
@classmethod
def classify(cls, error: Exception) -> ErrorCategory:
"""Classify an exception."""
error_str = str(error).lower()
# Check for poison indicators
if isinstance(error, (MemoryError, RecursionError)):
return ErrorCategory.POISON
for pattern in cls.POISON_PATTERNS:
if pattern in error_str:
return ErrorCategory.POISON
# Check for validation errors
if isinstance(error, cls.VALIDATION_ERRORS):
return ErrorCategory.VALIDATION
# Check for transient errors
if isinstance(error, cls.TRANSIENT_ERRORS):
return ErrorCategory.TRANSIENT
for pattern in cls.TRANSIENT_PATTERNS:
if pattern in error_str:
return ErrorCategory.TRANSIENT
return ErrorCategory.UNKNOWN
# =============================================================================
# Core DLQ Handler
# =============================================================================
class DLQHandler:
"""
Production-ready dead letter queue handler.
Features:
- Error classification and routing
- Exponential backoff
- Poison pill detection
- Metrics and alerting
- Replay support
"""
def __init__(
self,
config: DLQConfig,
kafka_producer,
metrics_client,
consumer_id: str,
consumer_version: str = "1.0.0"
):
self.config = config
self.producer = kafka_producer
self.metrics = metrics_client
self.consumer_id = consumer_id
self.consumer_version = consumer_version
# Track retry attempts per message
self._retry_counts: Dict[str, int] = {}
self._first_failures: Dict[str, datetime] = {}
# Metrics
self.messages_retried = 0
self.messages_dlqd = 0
self.poison_pills_detected = 0
def get_message_id(self, topic: str, partition: int, offset: int) -> str:
"""Generate unique ID for a message."""
content = f"{topic}:{partition}:{offset}"
return hashlib.md5(content.encode()).hexdigest()[:16]
async def handle_failure(
self,
topic: str,
partition: int,
offset: int,
key: Optional[bytes],
value: bytes,
headers: Dict[str, str],
error: Exception
) -> None:
"""
Handle a message processing failure.
Decides whether to retry or send to DLQ based on error type
and retry count.
"""
msg_id = self.get_message_id(topic, partition, offset)
# Classify error
category = ErrorClassifier.classify(error)
# Track first failure time
if msg_id not in self._first_failures:
self._first_failures[msg_id] = datetime.utcnow()
# Get current retry count
retry_count = self._retry_counts.get(msg_id, 0)
# Determine action based on category
if category == ErrorCategory.POISON:
# Never retry poison pills
logger.error(f"Poison pill detected: {error}")
self.poison_pills_detected += 1
await self._send_to_dlq(
topic, partition, offset, key, value, headers,
error, category, retry_count,
dlq_suffix=self.config.poison_dlq_suffix
)
self._cleanup_tracking(msg_id)
elif category == ErrorCategory.VALIDATION:
# Don't retry validation errors
logger.warning(f"Validation error, sending to DLQ: {error}")
await self._send_to_dlq(
topic, partition, offset, key, value, headers,
error, category, retry_count,
dlq_suffix=self.config.validation_dlq_suffix
)
self._cleanup_tracking(msg_id)
elif retry_count < self.config.max_retries:
# Retry with backoff
backoff = self._calculate_backoff(retry_count)
logger.info(
f"Retrying message (attempt {retry_count + 1}/{self.config.max_retries}), "
f"backoff: {backoff:.1f}s"
)
self._retry_counts[msg_id] = retry_count + 1
self.messages_retried += 1
# Wait for backoff
await asyncio.sleep(backoff)
# Raise to trigger reprocessing
raise RetryableError(f"Retry attempt {retry_count + 1}")
else:
# Max retries exceeded
logger.warning(f"Max retries exceeded, sending to DLQ: {error}")
await self._send_to_dlq(
topic, partition, offset, key, value, headers,
error, category, retry_count
)
self._cleanup_tracking(msg_id)
def _calculate_backoff(self, retry_count: int) -> float:
"""Calculate exponential backoff duration."""
backoff = self.config.base_backoff_seconds * (
self.config.backoff_multiplier ** retry_count
)
return min(backoff, self.config.max_backoff_seconds)
async def _send_to_dlq(
self,
topic: str,
partition: int,
offset: int,
key: Optional[bytes],
value: bytes,
headers: Dict[str, str],
error: Exception,
category: ErrorCategory,
retry_count: int,
dlq_suffix: Optional[str] = None
) -> None:
"""Send a failed message to the dead letter queue."""
# Determine DLQ topic
suffix = dlq_suffix or self.config.dlq_topic_suffix
dlq_topic = f"{topic}{suffix}"
# Parse original value
try:
original_value = json.loads(value.decode('utf-8'))
except:
original_value = {"raw": value.decode('utf-8', errors='replace')}
# Build dead letter
msg_id = self.get_message_id(topic, partition, offset)
first_failed = self._first_failures.get(msg_id, datetime.utcnow())
dead_letter = DeadLetter(
id=msg_id,
original_topic=topic,
original_partition=partition,
original_offset=offset,
original_key=key.decode('utf-8') if key else None,
original_value=original_value,
original_headers=headers,
error_category=category.value,
error_type=type(error).__name__,
error_message=str(error)[:1000], # Truncate long messages
stack_trace=traceback.format_exc()[:5000], # Truncate
first_failed_at=first_failed.isoformat(),
last_failed_at=datetime.utcnow().isoformat(),
retry_count=retry_count,
consumer_id=self.consumer_id,
consumer_version=self.consumer_version
)
# Send to DLQ
await self.producer.send(
dlq_topic,
value=json.dumps(asdict(dead_letter)).encode('utf-8'),
key=key
)
# Update metrics
self.messages_dlqd += 1
self.metrics.increment('dlq.messages.total', tags={'topic': topic, 'category': category.value})
logger.info(f"Message sent to DLQ: {dlq_topic}")
def _cleanup_tracking(self, msg_id: str) -> None:
"""Clean up tracking state for a message."""
self._retry_counts.pop(msg_id, None)
self._first_failures.pop(msg_id, None)
class RetryableError(Exception):
"""Raised to trigger message reprocessing."""
pass
# =============================================================================
# Consumer with DLQ Support
# =============================================================================
class DLQConsumer:
"""
Kafka consumer with integrated dead letter queue support.
"""
def __init__(
self,
kafka_consumer,
kafka_producer,
dlq_handler: DLQHandler,
process_fn: Callable[[Dict[str, Any]], Any],
config: DLQConfig
):
self.consumer = kafka_consumer
self.producer = kafka_producer
self.dlq_handler = dlq_handler
self.process_fn = process_fn
self.config = config
self._running = False
async def start(self):
"""Start consuming messages."""
self._running = True
logger.info("DLQ consumer started")
async for msg in self.consumer:
if not self._running:
break
await self._process_message(msg)
async def stop(self):
"""Stop the consumer."""
self._running = False
async def _process_message(self, msg) -> None:
"""Process a single message with DLQ handling."""
headers = dict(msg.headers) if msg.headers else {}
try:
# Parse message
value = json.loads(msg.value.decode('utf-8'))
# Detect poison pills with timeout
if self.config.poison_detection_enabled:
try:
await asyncio.wait_for(
self._process_with_timeout(value),
timeout=self.config.poison_timeout_seconds
)
except asyncio.TimeoutError:
raise PoisonPillError(
f"Processing timed out after {self.config.poison_timeout_seconds}s"
)
else:
await self.process_fn(value)
# Success - commit
await self.consumer.commit()
except RetryableError:
# Don't commit, let consumer refetch
pass
except Exception as e:
# Handle failure
try:
await self.dlq_handler.handle_failure(
topic=msg.topic,
partition=msg.partition,
offset=msg.offset,
key=msg.key,
value=msg.value,
headers=headers,
error=e
)
# Commit to move past this message
await self.consumer.commit()
except RetryableError:
# Don't commit, retry
pass
async def _process_with_timeout(self, value: Dict[str, Any]) -> None:
"""Process with timeout for poison pill detection."""
await self.process_fn(value)
class PoisonPillError(Exception):
"""Indicates a poison pill message."""
pass
# =============================================================================
# DLQ Replay Tool
# =============================================================================
class DLQReplayTool:
"""
Tool for investigating and replaying DLQ messages.
"""
def __init__(self, kafka_consumer, kafka_producer, db_client=None):
self.consumer = kafka_consumer
self.producer = kafka_producer
self.db = db_client # Optional: for tracking replays
async def list_messages(
self,
dlq_topic: str,
limit: int = 100,
error_category: Optional[str] = None
) -> List[DeadLetter]:
"""List messages in the DLQ."""
messages = []
async for msg in self.consumer:
if len(messages) >= limit:
break
dead_letter = DeadLetter(**json.loads(msg.value))
if error_category and dead_letter.error_category != error_category:
continue
messages.append(dead_letter)
return messages
async def replay_message(
self,
dead_letter: DeadLetter,
fix_fn: Optional[Callable[[Dict], Dict]] = None
) -> bool:
"""
Replay a single message from the DLQ.
Args:
dead_letter: The dead letter to replay
fix_fn: Optional function to fix the message before replay
"""
value = dead_letter.original_value
# Apply fix if provided
if fix_fn:
try:
value = fix_fn(value)
except Exception as e:
logger.error(f"Fix function failed: {e}")
return False
# Republish to original topic
try:
await self.producer.send(
dead_letter.original_topic,
value=json.dumps(value).encode('utf-8'),
key=dead_letter.original_key.encode() if dead_letter.original_key else None
)
logger.info(f"Replayed message {dead_letter.id} to {dead_letter.original_topic}")
return True
except Exception as e:
logger.error(f"Replay failed: {e}")
return False
async def replay_batch(
self,
dead_letters: List[DeadLetter],
fix_fn: Optional[Callable[[Dict], Dict]] = None,
dry_run: bool = False
) -> Dict[str, int]:
"""Replay multiple messages."""
results = {"success": 0, "failed": 0, "skipped": 0}
for dl in dead_letters:
if dry_run:
logger.info(f"[DRY RUN] Would replay: {dl.id}")
results["skipped"] += 1
continue
success = await self.replay_message(dl, fix_fn)
if success:
results["success"] += 1
else:
results["failed"] += 1
return results
5.3 Monitoring and Alerting
# DLQ Monitoring
class DLQMonitor:
"""
Monitors dead letter queue health and triggers alerts.
"""
def __init__(
self,
metrics_client,
alerting_client,
dlq_topics: List[str],
config: DLQConfig
):
self.metrics = metrics_client
self.alerting = alerting_client
self.dlq_topics = dlq_topics
self.config = config
async def check_health(self) -> Dict[str, Any]:
"""Check DLQ health across all topics."""
results = {}
for topic in self.dlq_topics:
depth = await self.metrics.get_queue_depth(topic)
growth_rate = await self.metrics.get_growth_rate(topic)
results[topic] = {
"depth": depth,
"growth_rate": growth_rate,
"status": self._determine_status(depth)
}
# Trigger alerts
if depth >= self.config.dlq_critical_threshold:
await self.alerting.send_alert(
severity="critical",
title=f"DLQ Critical: {topic}",
message=f"DLQ depth is {depth:,}, threshold is {self.config.dlq_critical_threshold:,}",
tags={"topic": topic}
)
elif depth >= self.config.dlq_alert_threshold:
await self.alerting.send_alert(
severity="warning",
title=f"DLQ Warning: {topic}",
message=f"DLQ depth is {depth:,}, threshold is {self.config.dlq_alert_threshold:,}",
tags={"topic": topic}
)
return results
def _determine_status(self, depth: int) -> str:
if depth >= self.config.dlq_critical_threshold:
return "critical"
elif depth >= self.config.dlq_alert_threshold:
return "warning"
return "healthy"
Chapter 6: Edge Cases and Error Handling
6.1 Edge Case 1: DLQ Consumer Also Fails
SCENARIO: Message sent to DLQ, but DLQ consumer fails
Main Queue: orders → Consumer → Failure → DLQ: orders.dlq
Then:
DLQ: orders.dlq → DLQ Consumer → Also Fails!
What now? DLQ for the DLQ?
SOLUTIONS:
1. DLQ CONSUMERS SHOULD BE SIMPLER
Don't process—just store and alert
async def dlq_consumer(msg):
# Just save to database, don't process
await db.insert("dlq_messages", msg)
await alert("New DLQ message", msg)
2. MANUAL INTERVENTION FLAG
If DLQ consumer fails, stop and alert
Require human to investigate
3. FALLBACK TO FILE STORAGE
If all else fails, write to local disk
try:
await send_to_dlq(msg)
except:
with open("emergency_dlq.jsonl", "a") as f:
f.write(json.dumps(msg) + "\n")
alert_critical("DLQ send failed!")
6.2 Edge Case 2: Replay Causes New Errors
SCENARIO:
1. Message fails with bug A
2. Bug A is fixed
3. Message replayed
4. Message fails with bug B!
TIMELINE:
Original: {"items": [{"qty": "5"}]} ← qty should be int
Bug A: Consumer expects qty to be int
Fix A: Consumer now converts string to int
Replay: Still fails!
Bug B: Consumer expects "price" field (new requirement)
Message format evolved, old messages no longer valid.
SOLUTIONS:
1. VERSIONED PROCESSING
Check message version, apply appropriate logic
if msg.get('version', 1) < 2:
# Old format processing
process_v1(msg)
else:
# New format processing
process_v2(msg)
2. MESSAGE MIGRATION ON REPLAY
Transform old messages to new format
def migrate_message(msg: dict) -> dict:
if 'price' not in msg:
msg['price'] = calculate_legacy_price(msg)
return msg
3. DISCARD STALE MESSAGES
After N days, messages may be too old to replay
Archive and close them out
6.3 Edge Case 3: Infinite Retry Loop
SCENARIO: Message keeps retrying forever
Consumer A receives message
→ Failure, retry
→ Republish to queue
Consumer B receives same message
→ Failure, retry
→ Republish to queue
Consumer A receives same message again...
ROOT CAUSE: Retry count in headers not being preserved
WRONG:
# Each consumer starts retry count at 0
retry_count = 0
RIGHT:
# Read retry count from message headers
retry_count = int(msg.headers.get('retry_count', 0))
ALSO CHECK:
- Headers being preserved on republish
- Retry count being incremented
- Max retries actually being enforced
6.4 Edge Case 4: Duplicate DLQ Messages
SCENARIO: Same message appears in DLQ multiple times
Consumer crashes AFTER sending to DLQ but BEFORE committing.
Consumer restarts, reprocesses same message.
Sends to DLQ again.
SOLUTION: Idempotent DLQ writes
Use message ID as deduplication key:
async def send_to_dlq(msg):
msg_id = get_message_id(msg.topic, msg.partition, msg.offset)
# Check if already in DLQ
if await dlq_store.exists(msg_id):
logger.info(f"Message {msg_id} already in DLQ, skipping")
return
# Send to DLQ
await producer.send(dlq_topic, msg)
# Record in DLQ store
await dlq_store.add(msg_id, ttl=timedelta(days=7))
6.5 Error Handling Matrix
| Error | Retry? | DLQ? | Alert? | Action |
|---|---|---|---|---|
| JSON parse error | No | Yes (validation) | No | Fix producer |
| Missing field | No | Yes (validation) | No | Fix producer/schema |
| Database timeout | Yes (3x) | Yes if exhausted | If DLQ'd | Wait for DB |
| Rate limit (429) | Yes (10x) | Yes if exhausted | Yes | Check quotas |
| Consumer crash | No | Yes (poison) | Yes (page) | Investigate |
| Out of memory | No | Yes (poison) | Yes (page) | Fix consumer |
| Unknown error | Yes (3x) | Yes if exhausted | Yes | Investigate |
Part III: Real-World Application
Chapter 7: How Big Tech Does It
7.1 Case Study: Uber — DLQ for Trip Events
UBER TRIP EVENT DLQ SYSTEM
Scale:
- Millions of trips per day
- Events: trip_started, trip_updated, trip_completed
- Each event updates pricing, driver earnings, rider charges
Challenge:
- Trip events must never be lost (financial implications)
- But system must stay responsive during spikes
- Some events have data issues from client bugs
Architecture:
Trip Events → Kafka → Trip Processor
│
┌────┴────┐
│ │
Success Failure
│ │
▼ ▼
Commit Classify
│
┌─────────┼─────────┐
│ │ │
Retryable Validation Poison
│ │ │
▼ ▼ ▼
Retry DLQ with Immediate
Topic Auto-fix Alert
│ Attempt
│ │
▼ ▼
Main Fixed?
Topic │
┌────┴────┐
YES NO
│ │
Replay Manual
Review
Key Innovations:
1. AUTO-FIX LAYER
Common validation errors have automatic fixes
- Missing timezone → default to trip city timezone
- Invalid currency → lookup from trip region
- Negative amount → take absolute value (with flag)
2. FINANCIAL RECONCILIATION
DLQ'd trip events trigger reconciliation job
Compares actual charges vs expected
Ensures no money is lost
3. SLA-BASED ALERTING
Trip must be processed within 15 minutes
DLQ messages > 15 min old → page on-call
Results:
- 99.99% of events processed normally
- 0.01% go to DLQ
- 80% of DLQ auto-fixed and replayed
- No lost trip data in production
7.2 Case Study: Stripe — Webhook DLQ
STRIPE WEBHOOK DELIVERY
Challenge:
- Millions of webhook deliveries per day
- Merchant endpoints vary wildly in reliability
- Must deliver OR tell merchant they missed events
DLQ Strategy:
Event Created → Webhook Queue → Delivery Attempt
│
┌────┴────┐
│ │
Success Failure
│ │
▼ ▼
Done Retry Queue
│
Exponential backoff:
1m, 5m, 30m, 2h, 8h, 24h
│
Still failing?
│
▼
Disable endpoint
Email merchant
Store in DLQ
Key Design Decisions:
1. LONG RETRY WINDOW
Retry for up to 3 days
Merchant might fix their server
2. SMART FAILURE CLASSIFICATION
- 4xx errors: Stop retrying (merchant's fault)
- 5xx errors: Keep retrying (might recover)
- Timeout: Keep retrying with backoff
3. ENDPOINT HEALTH TRACKING
Track success rate per endpoint
Temporarily disable unhealthy endpoints
Notify merchant to fix
4. EVENT REPLAY API
Merchants can request replay of missed events
Self-service recovery via dashboard
5. NO TRUE DLQ FOR MERCHANTS
After 3 days of failures:
- Events stored in "missed events" log
- Merchant can retrieve via API
- Not automatically retried
7.3 Case Study: Netflix — DLQ for Encoding Jobs
NETFLIX ENCODING PIPELINE
Challenge:
- Some videos fail to encode (corrupted source, unsupported format)
- Can't retry indefinitely (expensive compute)
- Need human review for edge cases
Architecture:
Upload → Validation → Encoding Queue → Encoder
│
┌────┴────┐
│ │
Success Failure
│ │
▼ ▼
Complete Classify
│
┌──────────────────┼──────────────┐
│ │ │
Transient Permanent Unknown
(retry) (skip) (review)
│ │ │
▼ ▼ ▼
Retry Queue Skip & Alert Review Queue
│
▼
Human Review
│
┌─────────┼─────────┐
│ │ │
Fix & Accept & Report
Retry Skip Bug
DLQ Categories:
1. TRANSIENT FAILURES
- Encoder crashed
- Resource contention
- Network issues
→ Automatic retry
2. PERMANENT FAILURES
- Unsupported codec (expected for some content)
- Corrupted source file
→ Skip, use lower quality fallback
3. UNKNOWN FAILURES
- New error types
- Unexpected behaviors
→ Human review queue
Operational Excellence:
Daily DLQ review meeting:
- Review new error types
- Categorize unknowns → transient or permanent
- Update classification rules
- Track trends
7.4 Summary: Industry Patterns
| Company | System | DLQ Strategy | Key Innovation |
|---|---|---|---|
| Uber | Trip events | Auto-fix layer | 80% auto-recovery |
| Stripe | Webhooks | Long retry + disable | 3-day retry window |
| Netflix | Encoding | Categorized + human review | Daily triage meetings |
| Shopify | Orders | Priority DLQ | VIP merchant fast lane |
| Airbnb | Bookings | Replay API | Self-service recovery |
| Messages | Sampling DLQ | Only DLQ 1% for analysis |
Chapter 8: Common Mistakes to Avoid
8.1 Mistake 1: No DLQ at All
❌ WRONG: Ignoring failures
async def process(msg):
try:
await do_work(msg)
except Exception as e:
logger.error(f"Failed: {e}")
# Message is lost forever!
Problems:
- No way to recover failed messages
- No visibility into failure patterns
- Silent data loss
✅ CORRECT: Always have a DLQ
async def process(msg):
try:
await do_work(msg)
except Exception as e:
logger.error(f"Failed: {e}")
await send_to_dlq(msg, e) # Preserve for recovery
8.2 Mistake 2: Retrying Poison Pills
❌ WRONG: Unlimited retries
while True:
try:
process(msg)
break
except:
time.sleep(1)
# Retry forever!
If msg causes crash or infinite loop:
- Consumer stuck on one message
- Queue backs up
- System fails
✅ CORRECT: Detect and quarantine poison pills
MAX_RETRIES = 3
MAX_PROCESSING_TIME = 30
async def safe_process(msg):
retry_count = get_retry_count(msg)
if retry_count >= MAX_RETRIES:
await send_to_dlq(msg, "Max retries exceeded")
return
try:
await asyncio.wait_for(
process(msg),
timeout=MAX_PROCESSING_TIME
)
except asyncio.TimeoutError:
await send_to_dlq(msg, "Processing timeout - possible poison pill")
except Exception as e:
await retry_or_dlq(msg, e, retry_count)
8.3 Mistake 3: Not Monitoring DLQ
❌ WRONG: DLQ as a black hole
# Send to DLQ and forget
await producer.send("orders.dlq", msg)
# No monitoring, no alerts
# DLQ grows to 1M messages unnoticed
✅ CORRECT: Monitor and alert on DLQ
# Send to DLQ with metrics
await producer.send("orders.dlq", msg)
metrics.increment("dlq.messages.total", tags={"topic": "orders"})
# Alert configuration
alerts:
- name: "DLQ Warning"
condition: dlq.depth > 100
severity: warning
- name: "DLQ Critical"
condition: dlq.depth > 1000
severity: critical
page_oncall: true
8.4 Mistake 4: No Replay Strategy
❌ WRONG: DLQ with no way to replay
Messages in DLQ, but:
- No tooling to view them
- No way to fix and replay
- No process for handling them
Result:
- DLQ becomes graveyard
- Messages rot forever
- Data effectively lost
✅ CORRECT: Build replay tooling from day one
class DLQTools:
async def list_messages(self, topic, limit=100):
"""View messages in DLQ."""
async def replay_single(self, msg_id):
"""Replay one message."""
async def replay_batch(self, filter, fix_fn=None):
"""Replay multiple messages with optional fix."""
async def discard(self, msg_id, reason):
"""Mark message as unrecoverable."""
async def export(self, topic, format="json"):
"""Export DLQ for analysis."""
8.5 Mistake 5: Same Processing for Retries
❌ WRONG: Retry with same logic that failed
First attempt:
call_external_api(msg) # Timeout
Second attempt:
call_external_api(msg) # Same timeout!
Third attempt:
call_external_api(msg) # Still timing out!
# Learned nothing, just wasted time
✅ CORRECT: Adaptive retry behavior
class AdaptiveProcessor:
async def process(self, msg):
retry_count = get_retry_count(msg)
if retry_count == 0:
# First attempt: normal timeout
timeout = 5.0
elif retry_count == 1:
# Second attempt: longer timeout
timeout = 30.0
else:
# Third+ attempt: very long timeout, simplified processing
timeout = 120.0
msg = simplify_message(msg) # Remove optional fields
await asyncio.wait_for(
self._do_process(msg),
timeout=timeout
)
8.6 Mistake Checklist
Before deploying DLQ handling, verify:
- DLQ exists — Every queue has a corresponding DLQ
- Retry limits set — Max retries defined and enforced
- Poison detection — Timeout on processing to catch hangs
- Error classification — Different errors handled differently
- Monitoring active — Alerts on DLQ depth and growth
- Replay tooling built — Can view, fix, and replay messages
- Runbook written — Steps for handling DLQ alerts
- Retention policy set — Know how long to keep DLQ messages
Part IV: Interview Preparation
Chapter 9: Interview Tips and Phrases
9.1 When to Bring Up DLQs
Bring up dead letter queues when:
- Designing any queue-based system
- Discussing error handling and resilience
- Interviewer asks "what if processing fails?"
- System has critical messages that can't be lost
- Discussing operational aspects of async systems
9.2 Key Phrases to Use
INTRODUCING DLQs:
"For error handling, I'd implement a dead letter queue pattern.
Messages that fail processing after N retries get moved to a
separate queue for investigation. This prevents one bad message
from blocking the entire pipeline."
EXPLAINING CLASSIFICATION:
"Not all failures are equal. I'd classify errors into categories:
validation errors go straight to DLQ since retrying won't help,
transient errors get retried with exponential backoff, and
anything that crashes or hangs the consumer—a poison pill—gets
immediately quarantined."
DISCUSSING RETRY STRATEGY:
"For retries, I'd use exponential backoff starting at 1 second,
doubling each time up to a max of 5 minutes. After 5 retries,
the message goes to the DLQ. The key is distinguishing between
transient failures worth retrying and permanent failures that
should go straight to DLQ."
ADDRESSING OPERATIONS:
"The DLQ isn't just a dumping ground—it's an operational tool.
We need monitoring on DLQ depth, alerting when it grows, and
tooling to inspect, fix, and replay messages. I'd also have
runbooks for common DLQ scenarios."
EXPLAINING POISON PILLS:
"A poison pill is a message that crashes the consumer every time
it's processed. Without detection, it creates an infinite loop:
process, crash, restart, process same message, crash again. I'd
add processing timeouts and crash detection to immediately
quarantine these before they block the queue."
9.3 Questions to Ask Interviewer
- "What's the criticality of messages? Can any be lost, or must all be processed?"
- "Are there different message types with different failure characteristics?"
- "What's the expected failure rate? That affects DLQ sizing and monitoring thresholds."
- "Is there a team to investigate DLQ messages, or should we automate recovery?"
9.4 Common Follow-up Questions
| Question | Good Answer |
|---|---|
| "How do you prevent poison pills?" | "Processing timeout—if a message takes more than 30 seconds, it's likely stuck. Also track crash patterns; if the same message causes repeated restarts, quarantine it immediately." |
| "How do you decide retry count?" | "Balance between giving transient failures time to recover and not wasting resources. Usually 3-5 retries with exponential backoff. Validation errors get 0 retries—they go straight to DLQ." |
| "What's in a DLQ message?" | "The original message, error details including stack trace, timestamp of first and last failure, retry count, and consumer ID. Enough context to debug without accessing other systems." |
| "How do you handle DLQ growth?" | "Monitor depth, alert at thresholds. Daily triage to categorize errors. Build auto-fix for common issues. Clear SLA: DLQ messages addressed within 24 hours." |
Chapter 10: Practice Problems
Problem 1: E-commerce Order Processing
Setup: You're designing the order processing system. Orders flow through: validation → inventory reservation → payment → fulfillment.
Requirements:
- No order can be lost
- System handles 1000 orders/minute
- Some orders have invalid data (wrong address format, etc.)
- Payment gateway occasionally times out
Questions:
- Where would you place DLQs in this pipeline?
- How would you handle validation vs payment failures differently?
- What's your replay strategy for failed orders?
- Different stages fail for different reasons
- Validation failures are permanent, payment failures are transient
- Consider: should a failed order ever be auto-replayed?
DLQ Placement:
Orders → Validation → Inventory → Payment → Fulfillment
│ │ │ │
▼ ▼ ▼ ▼
validation.dlq inventory.dlq payment.dlq fulfillment.dlq
Error Handling by Stage:
-
Validation failures
- No retry (bad data won't fix itself)
- DLQ immediately
- Notify customer of invalid order
- Agent can fix and resubmit
-
Inventory failures
- Retry 3x (might be race condition)
- If item truly out of stock → notify customer
- Don't retry indefinitely (item won't reappear)
-
Payment failures
- Retry 5x with longer backoff (gateway might recover)
- Check error code: 4xx = don't retry, 5xx = retry
- After DLQ: manual review, contact customer
-
Fulfillment failures
- Retry aggressively (critical to complete)
- Alert if DLQ'd (money already charged!)
- Manual fulfillment as backup
Replay Strategy:
- Validation DLQ: Customer support fixes, resubmits new order
- Payment DLQ: Retry after gateway recovery, or refund
- Never auto-replay without human approval (avoid double-charging)
Problem 2: Real-time Notification System
Setup: You're building a notification system that sends push notifications, emails, and SMS.
Requirements:
- 100K notifications/hour
- External providers (Twilio, SendGrid) have rate limits and outages
- Some notifications are time-sensitive (2FA codes)
Questions:
- How do you handle provider failures?
- What's your DLQ strategy for time-sensitive vs marketing notifications?
- How do you prevent one provider's issues from blocking others?
- Time-sensitive notifications need fast failure, not long retry
- Provider-specific DLQs prevent cross-contamination
- Consider fallback providers
Architecture:
Notifications → Priority Router
│
┌───────────┼───────────┐
│ │ │
Critical Normal Bulk
(2FA) (transactional) (marketing)
│ │ │
Fast fail Standard Long retry
+ fallback retry + DLQ
Provider Isolation:
Email Queue → Email Consumer → SendGrid
│
└── Failure → email.dlq
SMS Queue → SMS Consumer → Twilio
│
└── Failure → sms.dlq
Push Queue → Push Consumer → Firebase
│
└── Failure → push.dlq
Time-Sensitive Handling:
- 2FA codes: 1 retry, then try fallback provider, then DLQ
- Don't retry for minutes—the code expires
- DLQ for logging, not replay
Bulk Notifications:
- Long retry window (24 hours)
- Rate limit retries to not overwhelm provider
- DLQ only after extended failure
Problem 3: Financial Transaction Processing
Setup: You're processing financial transactions that must be applied exactly once.
Requirements:
- Zero tolerance for lost transactions
- Transactions must be idempotent (no double-processing)
- Regulatory requirement: all failures must be auditable
Questions:
- How do you ensure no transaction is lost?
- How do you prevent double-processing on replay?
- What audit trail do you maintain?
- Idempotency keys are critical
- DLQ must be durable
- Consider: what if DLQ write fails?
Zero-Loss Guarantee:
Transaction → Process → Success: Commit
│
└── Failure → DLQ (with acknowledgment)
│
└── DLQ write fails?
│
▼
Fallback to database table
Alert immediately
Idempotency:
async def process_transaction(txn):
# Check if already processed
if await db.exists("processed_txns", txn.id):
logger.info(f"Transaction {txn.id} already processed, skipping")
return
# Process
result = await apply_transaction(txn)
# Mark as processed
await db.insert("processed_txns", {
"txn_id": txn.id,
"processed_at": now(),
"result": result
})
Audit Trail:
CREATE TABLE dlq_audit (
id SERIAL PRIMARY KEY,
txn_id VARCHAR(64) NOT NULL,
original_message JSONB NOT NULL,
error_type VARCHAR(100) NOT NULL,
error_message TEXT,
failed_at TIMESTAMP NOT NULL,
retry_count INT NOT NULL,
replayed_at TIMESTAMP,
replayed_by VARCHAR(100),
replay_result VARCHAR(50),
CONSTRAINT dlq_audit_txn_unique UNIQUE (txn_id, failed_at)
);
Every DLQ message is:
- Written to audit table
- Kept for 7 years (regulatory)
- Includes who replayed and when
Chapter 11: Mock Interview Dialogue
Scenario: Design DLQ for Payment Processing
Interviewer: "We're building a payment processing system. How would you handle failures and ensure no payment is lost?"
You: "Great question. Let me first understand the failure modes we need to handle, then design the DLQ strategy.
For payment processing, I see three categories of failures:
First, validation failures — invalid card number, expired card, insufficient fields. These won't succeed on retry, so they should go directly to DLQ with no retries. We'd notify the customer to fix their payment method.
Second, transient failures — network timeouts, rate limits from the payment processor, temporary unavailability. These are worth retrying because they often self-resolve.
Third, poison pills — malformed messages that crash our processor, infinite loops, memory exhaustion. These need immediate quarantine.
For the architecture, I'd set it up like this:
Payment Request → Validation → Payment Queue → Processor
│ │
▼ ┌────┴────┐
validation.dlq Success Failure
│ │
▼ ▼
Done Classify
│
┌────────────┼────────────┐
│ │ │
Transient Permanent Poison
│ │ │
▼ ▼ ▼
Retry payment Immediate
Queue .dlq Alert
For transient failures, I'd use exponential backoff: 1 second, then 2, 4, 8, up to 60 seconds max. After 5 retries over about 2 minutes total, we'd give up and DLQ the message."
Interviewer: "What information would you store in the DLQ message?"
You: "The DLQ message needs enough context to debug and potentially replay without accessing other systems.
I'd include:
- Original payment request — full message including amount, customer, card token
- Error details — exception type, message, full stack trace
- Processing context — which processor attempted it, retry count, timestamps
- Kafka metadata — original topic, partition, offset for traceability
Something like:
{
"id": "dlq_abc123",
"original_message": {
"payment_id": "pay_789",
"amount": 99.99,
"customer_id": "cust_456",
"card_token": "tok_xxx"
},
"error": {
"type": "TimeoutError",
"message": "Payment processor timeout after 30s",
"stack_trace": "..."
},
"context": {
"processor_id": "processor-3",
"retry_count": 5,
"first_failed_at": "2024-01-15T10:00:00Z",
"last_failed_at": "2024-01-15T10:02:15Z"
}
}
This lets someone investigating the DLQ understand what happened without querying multiple systems."
Interviewer: "How would you handle replaying these failed payments?"
You: "Replay is tricky for payments because we absolutely cannot double-charge customers.
First, idempotency is critical. Every payment request has a unique payment_id. Before processing any replay, we check if that payment_id was already successfully processed. If yes, skip it.
Second, we'd have explicit replay tooling, not just re-publishing to the queue. Something like:
async def replay_payment(dlq_message):
payment_id = dlq_message.original_message['payment_id']
# Check if already succeeded
if await payment_db.is_processed(payment_id):
return "Already processed, skipping"
# Check if customer still wants this payment
if dlq_message.age > timedelta(hours=24):
return "Too old, requires manual review"
# Replay with fresh attempt
return await process_payment(dlq_message.original_message)
Third, for payments specifically, older DLQ messages need human review. If a payment failed 3 days ago, maybe the customer already bought elsewhere. We wouldn't auto-replay—we'd flag for customer support to reach out.
Finally, every replay is logged. Who replayed, when, what was the result. This gives us an audit trail for compliance."
Interviewer: "What metrics would you monitor?"
You: "For DLQ health, I'd track:
-
DLQ depth — number of messages waiting. Alert at 100 (warning) and 1000 (critical).
-
DLQ growth rate — messages added per minute. Spike detection for sudden failures.
-
DLQ age — oldest message in queue. If > 24 hours, we're not investigating fast enough.
-
Error category distribution — ratio of validation vs transient vs poison. Spike in poison pills means we have a bug.
-
Replay success rate — when we replay, how often does it succeed? Low success rate means our retry strategy is wrong.
Dashboard would show these per payment type, per processor, and aggregate. On-call gets paged for any poison pill or if DLQ depth exceeds critical threshold."
Summary
DAY 4 KEY TAKEAWAYS
CORE CONCEPT:
• Poison pill: message that can never be processed
• Dead letter queue: quarantine for failed messages
• Goal: graceful failure, not infinite retry loops
ERROR CLASSIFICATION:
• Validation errors → No retry, DLQ immediately
• Transient errors → Retry with backoff, then DLQ
• Poison pills → Never retry, immediate quarantine
IMPLEMENTATION:
• Exponential backoff: 1s, 2s, 4s, 8s... up to max
• Processing timeout to detect poison pills
• Preserve full context in DLQ messages
• Build replay tooling from day one
OPERATIONS:
• Monitor DLQ depth and growth rate
• Alert before DLQ becomes a crisis
• Daily triage of new error patterns
• Clear SLA for DLQ investigation
INTERVIEW TIPS:
• Mention DLQ when designing any async system
• Classify errors: validation vs transient vs poison
• Always discuss replay and idempotency together
• Don't forget monitoring and operational aspects
DEFAULT CHOICE:
• 3-5 retries with exponential backoff
• Categorized DLQs (validation, transient, poison)
• Process timeout = 2x normal processing time
📚 Further Reading
Official Documentation
- AWS SQS Dead-Letter Queues: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/dlx.html
- Kafka Error Handling: https://kafka.apache.org/documentation/#consumerconfigs
Engineering Blogs
- Uber Engineering: "Reliable Reprocessing" — https://eng.uber.com/reliable-reprocessing/
- Stripe Engineering: "Designing Robust Webhook Systems" — https://stripe.com/blog/webhooks
- Netflix Tech Blog: "Evolution of the Netflix Data Pipeline" — https://netflixtechblog.com/
Books
- "Release It!" by Michael Nygard — Chapters on stability patterns
- "Enterprise Integration Patterns" by Hohpe & Woolf — Dead Letter Channel pattern
Tools
- Kafka Connect DLQ: Built-in dead letter queue support
- Spring Retry: Retry and DLQ patterns for Java
- Celery: Task queue with retry and dead letter support
End of Day 4: Dead Letters and Poison Pills
Tomorrow: Day 5 — Audit Log System. We've learned how to handle message failures with DLQs. Tomorrow we'll design a complete audit log system—capturing every action, storing immutably, and querying efficiently. We'll combine everything from this week: queues, transactional outbox, backpressure, and error handling into one cohesive system.