Week Preview

Week 2 Preview: Failure-First Design

System Design Mastery Series

Welcome to Week 2

Last week, you learned how to build distributed systems—partitioning data, replicating for availability, protecting against overload, and handling skewed traffic. You built a session store for 10 million users.

This week, we assume everything you built will fail.

Networks will drop packets. Services will crash mid-request. Databases will timeout. Users will click buttons twice. And when failures cascade, your entire system can collapse in seconds.

Week 2 teaches you to design systems where failure is the normal case, not the exception.

The Week at a Glance

Day	Topic	Core Question
1	Timeout Hell	How do you set timeouts when you call 3 services, each with different latency?
2	Idempotency in Practice	User clicks "Pay" twice—did you charge them twice?
3	Circuit Breakers	Your payment provider is down—do you fail fast or retry forever?
4	Webhook Delivery	Receiver is down for 2 hours—how do you guarantee delivery?
5	Distributed Cron	Leader dies mid-job—does the job run twice, once, or never?

Week 2 Theme: Designing for Failure

The Mindset Shift

Week 1 thinking: "How do I make this system work?"

Week 2 thinking: "How does this system fail, and what happens when it does?"

Every design decision this week starts with the failure case:

Before choosing a timeout: "What happens when this times out?"
Before adding a retry: "What if the first request actually succeeded?"
Before implementing a circuit breaker: "What's the user experience when the circuit opens?"

Why This Matters

The systems that survive at scale are not the ones that never fail—they're the ones that fail gracefully. Consider these real-world incidents:

Amazon (2017): S3 outage took down half the internet because services had no fallback for "S3 unavailable"
Knight Capital (2012): Software bug + no circuit breaker = $440 million loss in 45 minutes
Stripe: Processes billions in payments by making every operation idempotent—double-clicks never double-charge

This week, you'll learn the patterns that make the difference.

The Three Systems You'll Design

System 1: Payment Processing Pipeline (Days 1-3)

The highest-stakes system in most companies. A bug here means real money lost, double charges, or failed transactions.

┌─────────────────────────────────────────────────────────────────────────┐
│                       Payment Processing Pipeline                        │
│                                                                          │
│  User clicks      ┌─────────┐    ┌─────────┐    ┌─────────┐            │
│  "Pay $99"   ───▶ │  Fraud  │───▶│  Bank   │───▶│ Notify  │───▶ Done   │
│                   │  Check  │    │   API   │    │ Service │            │
│                   └─────────┘    └─────────┘    └─────────┘            │
│                     200ms          2000ms          100ms                │
│                                                                          │
│  What can go wrong?                                                     │
│  • Fraud check times out                                                │
│  • Bank API returns "maybe" (network died mid-response)                 │
│  • User clicks Pay again while waiting                                  │
│  • Notification fails—does payment fail too?                            │
│  • Bank API is degraded on Black Friday                                 │
└─────────────────────────────────────────────────────────────────────────┘

Day 1: Set timeouts correctly. Handle the "bank is slow today" scenario.

Day 2: Make it idempotent. Handle the "user clicked twice" scenario.

Day 3: Add circuit breakers. Handle the "bank is down on Black Friday" scenario.

System 2: Webhook Delivery System (Day 4)

Deliver events reliably to external systems that you don't control.

┌─────────────────────────────────────────────────────────────────────────┐
│                       Webhook Delivery System                            │
│                                                                          │
│  Events:           ┌──────────┐                                         │
│  • order.created   │          │    ┌─────────────────────────┐          │
│  • payment.success │  Queue   │───▶│   Customer Endpoints    │          │
│  • refund.issued   │          │    │                         │          │
│                    └──────────┘    │  • https://customer1/wh │          │
│                         │          │  • https://customer2/wh │          │
│                         │          │  • https://customer3/wh │  (down)  │
│                         │          └─────────────────────────┘          │
│                         │                                               │
│                    ┌──────────┐                                         │
│                    │   DLQ    │  "Customer 3 has failed 50 times"       │
│                    │ (Dead    │  "Oldest undelivered: 2 hours ago"      │
│                    │  Letter) │                                         │
│                    └──────────┘                                         │
│                                                                          │
│  Challenges:                                                            │
│  • Customer endpoint is slow (10s response time)                        │
│  • Customer endpoint is down for maintenance                            │
│  • Customer endpoint returns 200 but didn't process                     │
│  • Same event delivered twice—customer must handle it                   │
│  • 1M webhooks/hour throughput requirement                              │
└─────────────────────────────────────────────────────────────────────────┘

Goal: Guarantee at-least-once delivery with clear visibility into failures.

System 3: Distributed Cron / Job Scheduler (Day 5)

Run scheduled jobs exactly once, even when servers crash mid-execution.

┌─────────────────────────────────────────────────────────────────────────┐
│                       Distributed Job Scheduler                          │
│                                                                          │
│  Jobs:                                                                  │
│  • "Send daily digest" — every day at 9am                               │
│  • "Generate reports" — every hour                                      │
│  • "Cleanup expired sessions" — every 15 minutes                        │
│                                                                          │
│  Scheduler Nodes:                                                       │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐                          │
│  │  Node 1  │    │  Node 2  │    │  Node 3  │                          │
│  │ (Leader) │    │(Follower)│    │(Follower)│                          │
│  └────┬─────┘    └──────────┘    └──────────┘                          │
│       │                                                                 │
│       │ "9:00 AM — trigger daily_digest job"                           │
│       ▼                                                                 │
│  ┌──────────────────────────────────────────┐                          │
│  │              Worker Pool                  │                          │
│  │  Worker 1: Running daily_digest (50%)    │                          │
│  │  Worker 2: Idle                          │                          │
│  │  Worker 3: Idle                          │                          │
│  └──────────────────────────────────────────┘                          │
│                                                                          │
│  Failure scenarios:                                                     │
│  • Leader dies before triggering 9am job — job missed?                  │
│  • Leader dies after triggering but before marking complete — runs 2x? │
│  • Worker dies mid-job — restart from scratch or resume?                │
│  • Clock skew between nodes — job triggers at wrong time?               │
│  • Deploy happens at 9am — job runs on both old and new version?        │
└─────────────────────────────────────────────────────────────────────────┘

Goal: Jobs run exactly once, on schedule, even through failures and deployments.

Key Concepts for Week 2

Concept 1: Failure Modes

Not all failures are the same. Understanding the type of failure changes your response.

Failure Mode	Description	Example	How to Handle
Crash	Process dies suddenly	OOM kill, hardware failure	Restart, failover
Omission	Message lost, no response	Network partition, packet drop	Timeout + retry
Timing	Response too slow	Overloaded service, GC pause	Timeout + fallback
Byzantine	Incorrect behavior	Bug, data corruption	Harder—validation, checksums

This week focuses on Omission and Timing failures—the most common in distributed systems.

Concept 2: Timeouts

Timeouts seem simple until you have to choose actual numbers.

Your service has 5 seconds to respond to users.
You call:
  - Service A: P99 = 200ms
  - Service B: P99 = 500ms
  - Service C: P99 = 100ms

Question: What timeouts do you set?

❌ Bad answer: "5 seconds for each"
   → If A is slow, you have no time left for B and C

❌ Bad answer: "Use their P99 as timeout"
   → P99 means 1% of requests are slower—you'll timeout constantly

✅ Good answer: "Timeout budget with headroom"
   → Total budget: 4.5s (leave 500ms buffer)
   → A: 500ms (2.5x P99), B: 2s (4x P99), C: 300ms (3x P99)
   → Parallel where possible to save time

Day 1 deep-dives into timeout strategies, cascading failures, and adaptive timeouts.

Concept 3: Retry Strategies

Retrying failed requests seems helpful, but naive retries cause stampedes.

10 servers each retry 3 times with no delay:

Request fails at 10:00:00.000
  Server 1 retries at 10:00:00.001
  Server 2 retries at 10:00:00.001
  Server 3 retries at 10:00:00.001
  ... (30 requests hit at once)

This amplifies load 3x and can take down an already struggling service.

Retry best practices:

Exponential backoff: Wait longer between each retry
Jitter: Add randomness so retries don't synchronize
Retry budgets: Limit total retries across all clients
Idempotency: Ensure retries don't cause duplicate effects

Concept 4: Idempotency

An operation is idempotent if performing it multiple times has the same effect as performing it once.

Idempotent:
  DELETE /users/123  — Deleting twice = user still deleted
  PUT /users/123 {"name": "Alice"}  — Setting twice = name is Alice
  
NOT Idempotent:
  POST /orders  — Creating twice = two orders
  POST /payments  — Paying twice = charged twice!

Making non-idempotent operations idempotent:

# Client sends idempotency key
POST /payments
Idempotency-Key: user123-order456-attempt1
{amount: 99.00}

# Server logic:
if idempotency_key in processed_requests:
    return cached_response  # Don't charge again
else:
    process_payment()
    cache_response(idempotency_key, response)
    return response

Day 2 covers idempotency key strategies, deduplication windows, and edge cases.

Concept 5: Circuit Breakers

A circuit breaker stops calling a failing service to prevent cascade failures.

States:
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   ┌──────────┐     failures > threshold     ┌──────────┐       │
│   │  CLOSED  │ ──────────────────────────▶  │   OPEN   │       │
│   │(normal)  │                              │(failing) │       │
│   └────┬─────┘                              └─────┬────┘       │
│        │                                          │            │
│        │ success                     timeout      │            │
│        │                                          ▼            │
│        │                               ┌─────────────────┐     │
│        │                               │   HALF-OPEN    │     │
│        │                               │ (testing)      │     │
│        │                               └────────┬───────┘     │
│        │                                        │              │
│        │         success                failure │              │
│        ◀────────────────────────────────────────┘              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CLOSED: Normal operation, requests flow through
OPEN: Fail immediately, don't call downstream service
HALF-OPEN: Allow one request through to test if service recovered

When circuit breakers help: Prevent cascade failures, fail fast, give struggling services time to recover.

When they hurt: Open circuit during legitimate traffic spike, block all users when only some requests fail.

Day 3 covers circuit breaker implementation, tuning, and alternatives.

Concept 6: Delivery Guarantees

When delivering messages, you can choose your guarantee:

Guarantee	Meaning	Implementation
At-most-once	Message may be lost, never duplicated	Fire and forget
At-least-once	Message delivered 1+ times, may duplicate	Retry until ACK
Exactly-once	Message delivered exactly once	At-least-once + idempotency

The truth about exactly-once: It's impossible in distributed systems without receiver cooperation. What we actually implement is "at-least-once delivery with idempotent receivers."

Day 4 covers how to build reliable webhook delivery with these guarantees.

Concept 7: Leader Election

When multiple nodes could do a job, how do you ensure only one does it?

Without leader election:
  Node 1: "It's 9 AM, run daily_digest"
  Node 2: "It's 9 AM, run daily_digest"
  Node 3: "It's 9 AM, run daily_digest"
  → Job runs 3 times!

With leader election:
  Node 1 (Leader): "It's 9 AM, run daily_digest"
  Node 2 (Follower): "Node 1 is leader, I'll wait"
  Node 3 (Follower): "Node 1 is leader, I'll wait"
  → Job runs once

Leader election mechanisms:

Consensus algorithms (Raft, Paxos)
Lease-based (acquire lock with TTL)
External coordinator (ZooKeeper, etcd, Redis)

Day 5 covers leader election, fencing tokens, and building reliable schedulers.

Daily Breakdown

Day 1: Timeout Hell

Morning Concept (10 min)

Timeout budgets: Dividing time across dependent services
Cascading timeout failures: How one slow service takes down everything
Adaptive timeouts: Adjusting based on observed latency

Design Challenge (35 min)

Design a payment service calling:

Fraud check (P99 = 200ms)
Bank API (P99 = 2s)
Notification service (P99 = 100ms)

Challenge questions:

What timeout do you set for each service?
Bank API is slow today (P99 = 5s)—what happens to your users?
How do you prevent notification failure from failing the whole payment?

Discussion (15 min)

Would you use adaptive timeouts? What's the risk?
How do you detect a slow downstream before users notice?
What's the difference between timeout and deadline propagation?

Day 2: Idempotency in Practice

Morning Concept (10 min)

Idempotency key strategies: Client-generated vs server-generated
Deduplication windows: How long to remember processed requests
The "network timeout" problem: Request succeeded but client doesn't know

Design Challenge (35 min)

Design payment retry logic for the system from Day 1.

Challenge scenario:

User clicks "Pay $99"
Request sent to your server
Your server calls bank API
Network timeout—did the bank charge or not?
User clicks "Pay" again

Design the system so the user is never charged twice.

Discussion (15 min)

Where do you store idempotency keys? (Redis? Database?)
What's the right TTL for idempotency records?
How do you handle idempotency key collisions?

Day 3: Circuit Breakers

Morning Concept (10 min)

Circuit breaker states: Closed, Open, Half-Open
Failure detection: Count-based vs time-based
When circuit breakers cause harm

Design Challenge (35 min)

Add circuit breakers to your payment system.

Challenge scenario: It's Black Friday. Bank API is degraded (50% of requests failing, P99 = 10s).

Circuit breaker opens after 10 failures
1000 customers are trying to pay right now

Questions:

What's the customer experience when the circuit opens?
How do you communicate "try again later" vs "payment failed"?
Should you have different circuit breaker settings for different times?

Discussion (15 min)

Circuit breaker vs retry with backoff vs bulkhead: When each?
How do you test circuit breakers in production?
What metrics do you monitor for circuit breaker health?

Day 4: Webhook Delivery

Morning Concept (10 min)

Delivery guarantees: At-most-once, at-least-once, exactly-once
Why receivers must be idempotent
Webhook security: Signatures, replay attacks

Design Challenge (35 min)

Design a webhook system delivering 1M webhooks/hour.

Challenge scenario: Customer's endpoint is down for 2 hours.

How many retries?
What backoff strategy?
When do you give up and put in dead letter queue?
How does the customer know they missed webhooks?

Discussion (15 min)

Design the dead letter queue and manual retry interface
How do you handle slow receivers (10s response time)?
What's your strategy for a receiver that returns 200 but doesn't process?

Day 5: Distributed Cron

Morning Concept (10 min)

Leader election basics: Why it's needed, how it works
Fencing tokens: Preventing "zombie leaders" from causing duplicate runs
Why ZooKeeper/etcd exist

Design Challenge (35 min)

Design a job scheduler where jobs run exactly once, even during deploys.

Challenge scenario: Leader dies mid-job.

Does the job restart from the beginning?
Does it resume where it left off?
How do you prevent it from running twice on two different nodes?

Discussion (15 min)

Compare to Celery Beat, Kubernetes CronJobs, and Temporal
How do cloud providers (AWS, GCP) solve this?
What's the simplest solution that actually works?

Skills You'll Have by Friday

Technical Skills

Skill	You Can Now...
Timeout design	Set timeouts for multi-service calls without cascade failures
Idempotency	Design payment systems that never double-charge
Circuit breakers	Protect systems from downstream failures
Webhook delivery	Build reliable at-least-once delivery systems
Leader election	Ensure scheduled jobs run exactly once

Interview Skills

By the end of Week 2, you'll be able to answer:

"How do you handle a downstream service that's slow?"
"How do you prevent duplicate payments?"
"Design a system to reliably deliver notifications to external systems"
"How do you ensure a scheduled job runs exactly once?"
"What happens when [any component] fails?"

Mental Model

The biggest shift: You'll start every design by asking "What happens when this fails?"

This is the difference between junior and senior system design:

Junior: "Here's how it works"
Senior: "Here's how it works, here's how it fails, and here's how we handle that"

Prerequisites Check

Before starting Week 2, make sure you're comfortable with:

From Week 1

Partitioning: You know how data is distributed across nodes
Replication: You know how data is copied for availability
Consistency models: You understand eventual vs strong consistency

General Knowledge

HTTP basics: Status codes, headers, request/response flow
Basic queuing: Producer/consumer pattern, message acknowledgment
Database transactions: ACID properties, commit/rollback

Helpful but Not Required

Experience with Redis or similar
Familiarity with payment systems
Knowledge of cron syntax

What Makes Week 2 Different

Week 1 vs Week 2

Aspect	Week 1	Week 2
Focus	Data storage and retrieval	Operations that can fail
Key question	"Where does data live?"	"What happens when this fails?"
Systems	Session store (CRUD-focused)	Payment pipeline (transaction-focused)
Main challenge	Scale and distribution	Reliability and correctness
Failure handling	"Add replicas"	"Add retries, idempotency, circuit breakers"

The Payment System Thread

Days 1-3 build on each other with the same payment system:

Day 1: Build it with proper timeouts
         ↓
Day 2: Make it idempotent (handle double-clicks)
         ↓
Day 3: Add circuit breakers (handle downstream failures)
         ↓
Result: Production-ready payment processing pattern

This is how real systems are built—layer by layer, addressing failure modes one at a time.

Common Pitfalls to Avoid

Pitfall 1: "We'll Just Retry"

❌ Bad: Retry immediately, 3 times, on any error
   → Amplifies load on struggling services
   → May cause duplicate operations

✅ Good: Exponential backoff with jitter, only on retryable errors
   → Gives services time to recover
   → Idempotency handles duplicates

Pitfall 2: "Long Timeout = Safe"

❌ Bad: Set 30s timeout "just to be safe"
   → User stares at spinner for 30 seconds
   → Thread pool exhausted waiting
   → Cascade failure when requests pile up

✅ Good: Timeout based on SLA + small buffer
   → Fail fast, show user meaningful error
   → Free resources for other requests

Pitfall 3: "Circuit Breaker Opens = System Down"

❌ Bad: Circuit opens, return 500 to all users
   → Same experience as if you had no circuit breaker

✅ Good: Circuit opens, use fallback or degraded mode
   → "Payment processing delayed, we'll email confirmation"
   → Partial functionality is better than nothing

Pitfall 4: "Exactly-Once is Simple"

❌ Bad: Assume exactly-once delivery is a solved problem
   → Build system that breaks with duplicates
   → Production incident on first retry

✅ Good: Build for at-least-once with idempotent receivers
   → Every operation can safely be replayed
   → "Exactly-once" is the emergent behavior

Quick Reference: Week 2 Patterns

Timeout Budget

total_budget = 5000  # ms
service_a_timeout = 500   # Allow 10% for critical fast service
service_b_timeout = 3000  # Allow 60% for slow but important service
buffer = 1500             # Reserve 30% for processing + unknowns

Exponential Backoff with Jitter

def get_retry_delay(attempt: int) -> float:
    base_delay = 0.1  # 100ms
    max_delay = 30.0  # 30 seconds
    
    exponential = min(base_delay * (2 ** attempt), max_delay)
    jitter = random.uniform(0, exponential * 0.1)
    
    return exponential + jitter

Idempotency Key Pattern

def process_payment(idempotency_key: str, amount: float):
    # Check if already processed
    existing = redis.get(f"idem:{idempotency_key}")
    if existing:
        return json.loads(existing)  # Return cached response
    
    # Process payment
    result = bank_api.charge(amount)
    
    # Cache response for future retries (24 hour TTL)
    redis.setex(f"idem:{idempotency_key}", 86400, json.dumps(result))
    
    return result

Circuit Breaker States

class CircuitBreaker:
    def call(self, func):
        if self.state == OPEN:
            if time.time() > self.next_attempt_time:
                self.state = HALF_OPEN
            else:
                raise CircuitOpenError()
        
        try:
            result = func()
            self.record_success()
            return result
        except Exception as e:
            self.record_failure()
            raise

Let's Begin

Week 2 is about embracing failure as a first-class concern. Every system fails—the question is whether your system fails gracefully or catastrophically.

By Friday, you'll have designed:

A payment pipeline that handles timeouts, retries, and circuit breakers
A webhook system that guarantees delivery
A job scheduler that runs jobs exactly once

You'll think differently about system design. You'll ask "what if this fails?" before "how do I make this work?"

Let's start with Day 1: Timeout Hell.

"Everything fails, all the time." — Werner Vogels, CTO of Amazon

Back to Course Overview