Day 02

Week 1 — Day 2: Replication Trade-offs

System Design Mastery Series

Preface

Yesterday, you learned how to split data across machines. Today, you learn how to copy it.

Replication seems simple: keep copies of data on multiple machines. But the moment you have copies, you face the hardest questions in distributed systems:

Which copy is the "truth"?
How long can copies be out of sync?
What happens when the machine holding the truth dies?

By the end of this session, you won't just know what replication is—you'll understand exactly what you're trading with every replication decision, and you'll be able to design systems that handle these trade-offs explicitly.

Let's begin.

Part I: Foundations

Chapter 1: Why Replication Matters

1.1 The Three Reasons to Replicate

Every replicated system exists for one or more of these reasons:

Reason 1: High Availability

If your database server dies, your application dies with it. With replicas, you can failover to another copy and keep serving requests.

The math: A single server with 99.9% uptime has about 8.7 hours of downtime per year. Two servers, either of which can serve requests, have combined uptime of 99.9999%—about 31 seconds of downtime per year (assuming independent failures).

Reason 2: Read Scaling

A single database might handle 10,000 reads per second. With 5 read replicas, you can handle 50,000 reads per second. Writes still go to one place, but reads spread across all copies.

When this works: Read-heavy workloads (10:1 or higher read-to-write ratio). Most web applications qualify.

When this doesn't work: Write-heavy workloads. Replication doesn't help with write throughput—every write must go to every replica.

Reason 3: Geographic Distribution

Users in Tokyo shouldn't wait for a round trip to a server in Virginia. With replicas in multiple regions, users read from nearby copies.

The physics: Light travels ~200km per millisecond in fiber. Tokyo to Virginia is ~11,000km = ~55ms minimum latency each way. A local replica reduces this to <10ms.

1.2 The Fundamental Trade-off

Here's the truth about replication that every engineer must internalize:

You cannot have instantaneous, consistent replication without sacrificing availability or performance.

This isn't a limitation of current technology. It's physics and mathematics. The CAP theorem formalizes this, but the intuition is simple:

Keeping replicas perfectly in sync requires coordination
Coordination takes time (network round trips)
During that time, you're either waiting (slower) or serving potentially stale data (inconsistent)

Every replication strategy is a different point on this trade-off spectrum. Your job is to choose the right point for your system.

Chapter 2: Replication Architectures

There are three fundamental architectures for replication. Each makes different trade-offs.

2.1 Leader-Follower (Primary-Replica)

How It Works

One node is the leader (also called primary or master). All writes go to the leader. The leader then sends changes to followers (also called replicas or secondaries). Reads can go to the leader or any follower.

         Writes
            │
            ▼
      ┌──────────┐
      │  Leader  │
      └────┬─────┘
           │ Replication stream
     ┌─────┼─────┐
     ▼     ▼     ▼
┌────────┐┌────────┐┌────────┐
│Follower││Follower││Follower│
└────────┘└────────┘└────────┘
     ▲     ▲     ▲
     │     │     │
   Reads (distributed)

Synchronous vs Asynchronous

Synchronous replication: Leader waits for follower(s) to confirm the write before acknowledging to the client.

Client ──Write──▶ Leader ──Replicate──▶ Follower
                    │                      │
                    │◀──────Confirm────────┤
                    │
Client ◀───ACK─────┤

Pros:

Followers are always up-to-date
No data loss on leader failure

Cons:

Write latency includes network round-trip to follower
If follower is slow or dead, writes stall

Asynchronous replication: Leader acknowledges to client immediately, replicates in the background.

Client ──Write──▶ Leader ──ACK──▶ Client
                    │
                    │ (background)
                    ▼
                 Follower

Pros:

Fast writes (no waiting for followers)
Followers being slow doesn't affect clients

Cons:

Followers may lag behind leader
Data loss possible on leader failure (unreplicated writes)

Semi-Synchronous: The Practical Middle Ground

Most production systems use semi-synchronous replication: wait for at least one follower to confirm, then acknowledge to client. Other followers replicate asynchronously.

Client ──Write──▶ Leader ──Replicate──▶ Follower 1 (sync)
                    │                      │
                    │◀──────Confirm────────┤
                    │
Client ◀───ACK─────┤
                    │
                    ▼ (async)
              Follower 2, 3, ...

This gives you:

Durability: At least one replica has every committed write
Performance: Only wait for the fastest follower
Availability: Can tolerate slow followers

When Leader-Follower Breaks

Failure mode 1: Leader dies

With async replication, some writes may not have reached any follower. Those writes are lost.

With sync replication, a follower can be promoted to leader without data loss. But during failover (typically 10-60 seconds), writes fail.

Failure mode 2: Split brain

Network partition isolates the leader. Followers think it's dead and elect a new leader. Now you have two leaders accepting writes. When the network heals, you have conflicting data.

Failure mode 3: Replication lag

Under heavy load, followers fall behind. A read from a follower returns stale data. In extreme cases, followers can be minutes or hours behind.

2.2 Multi-Leader (Multi-Master)

How It Works

Multiple nodes accept writes. Each leader replicates to the others.

      Writes              Writes
         │                   │
         ▼                   ▼
   ┌──────────┐        ┌──────────┐
   │ Leader 1 │◀──────▶│ Leader 2 │
   └────┬─────┘        └────┬─────┘
        │    Replication    │
        │ ┌────────────────┐│
        │ │                ││
        ▼ ▼                ▼▼
   ┌──────────┐        ┌──────────┐
   │ Follower │        │ Follower │
   └──────────┘        └──────────┘

Why Use Multi-Leader?

Geographic distribution: Users in Europe write to the Europe leader; users in US write to the US leader. Each region has low-latency writes.

Datacenter failover: If one datacenter goes down, the other continues accepting writes. No failover delay.

The Conflict Problem

When two leaders accept conflicting writes, what happens?

Scenario: User updates their email on their phone (hits EU leader) and their laptop (hits US leader) at the same time.

EU Leader: UPDATE users SET email='new@eu.com' WHERE id=123
US Leader: UPDATE users SET email='new@us.com' WHERE id=123

Both succeed locally. When they replicate to each other, which one wins?

Conflict resolution strategies:

Last-write-wins (LWW): Use timestamps; latest write wins. Simple but can lose data if clocks disagree.
Merge: For some data types, merge is possible. CRDTs (Conflict-free Replicated Data Types) are designed for this.
Application-level resolution: Notify the application of the conflict; let it decide.
Avoid conflicts: Design the system so conflicts are rare. For example, each user's data is primarily written in one region.

When Multi-Leader Breaks

Conflict resolution is hard: Most teams underestimate the complexity. LWW causes data loss. Custom resolution requires significant engineering.

Operational complexity: Debugging issues across multiple leaders is harder. "Which leader was this written to?" becomes a real question.

Consistency is weak: By design, you accept that replicas diverge temporarily. Some applications can't tolerate this.

2.3 Leaderless (Dynamo-style)

How It Works

No designated leader. Clients write to multiple nodes directly. Reads also go to multiple nodes, and the client resolves differences.

          Write
             │
      ┌──────┼──────┐
      ▼      ▼      ▼
   ┌─────┐┌─────┐┌─────┐
   │Node1││Node2││Node3│
   └─────┘└─────┘└─────┘
      │      │      │
      └──────┼──────┘
             │
         Read (quorum)

Quorums: The Math of Consistency

With N replicas, you configure:

W: Number of nodes that must confirm a write
R: Number of nodes that must respond to a read

If W + R > N, every read is guaranteed to see the latest write (at least one node overlaps).

Example with N=3:

W=2, R=2: Read and write quorums overlap → strong consistency
W=1, R=1: No overlap → eventual consistency, but faster

Nodes:     [A] [B] [C]

Write (W=2): Write to A ✓, B ✓, C pending
Read (R=2):  Read from B ✓, C ✓

B was in both → B has the latest write → read sees it

Sloppy Quorums and Hinted Handoff

What if nodes are unavailable? Strict quorums would fail the request. Sloppy quorums allow writing to any available nodes, with hinted handoff—the temporary node holds the write and forwards it when the proper node recovers.

This improves availability but weakens consistency guarantees.

When Leaderless Breaks

Clock skew: Many leaderless systems use timestamps for conflict resolution. Clock skew can cause newer writes to be overwritten by older ones.

Complexity: Clients must handle quorum logic, conflict resolution, and read repair. The complexity shifts to the client or a thick client library.

Tail latency: Waiting for multiple nodes (quorum) means your latency is the slowest responder, not the average.

Chapter 3: Replication Lag and Its Consequences

3.1 What Is Replication Lag?

In asynchronous replication, followers don't have the latest data immediately. The delay between a write on the leader and that write appearing on a follower is replication lag.

Normal conditions: Milliseconds to low seconds.

Under load: Can grow to minutes or longer.

Failure conditions: If a follower is disconnected, lag is infinite until it reconnects.

3.2 Anomalies Caused by Replication Lag

Anomaly 1: Read Your Own Writes

User submits a form. The write goes to the leader. User's next request reads from a follower that hasn't received the write yet. User sees the old data and thinks their submission failed.

Time 0: User writes "new email" → Leader
Time 1: User reads profile → Follower (lag: hasn't received write yet)
Time 2: User sees old email, confused
Time 3: Follower receives write
Time 4: User refreshes, sees new email, more confused

Solutions:

Read from leader for recently-written data
Track write timestamps; only read from followers caught up to that point
Sticky sessions: route user to the same follower consistently

Anomaly 2: Monotonic Reads

User makes two read requests. First hits follower A (caught up). Second hits follower B (lagging). User sees data go "backwards in time."

Read 1 → Follower A: Returns version 5
Read 2 → Follower B: Returns version 3

User: "Where did my data go?"

Solution: Sticky sessions or reading from the same replica within a session.

Anomaly 3: Consistent Prefix Reads

Two causally related writes (A, then B) may appear in different order on different followers.

Leader:     Write A (9:00), Write B (9:01)
Follower 1: Receives B (9:02), Receives A (9:03)

Read from Follower 1 at 9:02:30:
  - Sees B but not A
  - B depends on A (e.g., "Reply to message A")
  - User sees reply to non-existent message

Solution: Ensure causally related writes go through the same partition/leader, or use logical clocks.

3.3 Measuring and Monitoring Lag

Every replicated system should expose:

Replication lag in seconds/bytes: How far behind is each follower?
Lag histogram: Distribution of lag across followers
Lag alerts: Page when lag exceeds acceptable threshold

-- PostgreSQL: Check replication lag
SELECT 
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;

Chapter 4: Consistency Models

Different applications need different consistency guarantees. Choosing the wrong one means either poor performance (too strong) or confusing bugs (too weak).

4.1 Strong Consistency

Every read returns the most recent write. All clients see the same data at the same time.

How to achieve it: Synchronous replication, or read from leader only.

Cost: Higher latency (coordination required), lower availability (leader failure blocks reads).

When you need it:

Financial transactions
Inventory counts (avoid overselling)
Any case where stale reads cause real harm

4.2 Eventual Consistency

If no new writes occur, all replicas will eventually converge to the same value. No guarantee about when.

How to achieve it: Asynchronous replication with no special handling.

Cost: Clients may see stale data, out-of-order updates, or temporary inconsistencies.

When it's acceptable:

Social media feeds (seeing a post 30 seconds late is fine)
Analytics dashboards (approximate numbers are acceptable)
Caches (by definition, allowed to be stale)

4.3 Read-Your-Writes Consistency

A user always sees their own writes immediately. Other users may see stale data.

How to achieve it:

Route user's reads to the leader for recently-written data
Track user's last write timestamp; only read from caught-up replicas

When you need it:

User profiles (user updates settings, expects to see them)
Document editing (author's changes should appear immediately)
Any self-referential read after write

4.4 Monotonic Reads

Once a user sees a version of data, they never see an older version on subsequent reads.

How to achieve it: Sticky sessions (same replica per user session).

When you need it:

Conversation threads (messages shouldn't disappear and reappear)
Activity timelines (events shouldn't reorder)

4.5 Causal Consistency

If event A causes event B, everyone sees A before B. Concurrent events may be seen in different orders.

How to achieve it: Logical clocks (vector clocks, Lamport timestamps) to track causality.

When you need it:

Messaging (reply should appear after original message)
Collaborative editing (operations that depend on each other must apply in order)

4.6 Choosing Your Consistency Level

Use Case	Recommended Level	Why
Bank balance	Strong	Incorrect balance = real money problems
E-commerce inventory	Strong	Overselling costs money and trust
Social media feed	Eventual	Late posts don't cause harm
User profile (to user)	Read-your-writes	User expects to see their changes
User profile (to others)	Eventual	Others can wait a bit
Chat messages	Causal	Replies must follow messages
View counter	Eventual	Approximate is fine
URL shortener reads	Eventual	1-second stale is fine
URL shortener writes	Strong	URL must exist before sharing

Part II: The Design Challenge

Chapter 5: Extending the URL Shortener with Replication

Yesterday, you designed a partitioned URL shortener. Today, we add replication for availability and read scaling.

5.1 Current State (From Day 1)

                        ┌─────────────────────────────────────┐
                        │          App Servers                │
                        └───────────────┬─────────────────────┘
                                        │
                                 ┌──────▼──────┐
                                 │    Cache    │
                                 │   (Redis)   │
                                 └──────┬──────┘
                                        │
                    ┌───────────────────┼───────────────────┐
                    ▼                   ▼                   ▼
             ┌───────────┐       ┌───────────┐       ┌───────────┐
             │Partition 0│       │Partition 1│       │Partition 2│
             └───────────┘       └───────────┘       └───────────┘

Each partition is a single PostgreSQL instance. If Partition 1 dies, all URLs mapped to it are unavailable.

5.2 Adding Read Replicas

For each partition, we add read replicas:

                              ┌─────────────┐
                              │    Cache    │
                              └──────┬──────┘
                                     │
               ┌─────────────────────┼─────────────────────┐
               ▼                     ▼                     ▼
        ┌─────────────┐       ┌─────────────┐       ┌─────────────┐
        │ Partition 0 │       │ Partition 1 │       │ Partition 2 │
        │   Primary   │       │   Primary   │       │   Primary   │
        └──────┬──────┘       └──────┬──────┘       └──────┬──────┘
               │                     │                     │
        ┌──────┴──────┐       ┌──────┴──────┐       ┌──────┴──────┐
        ▼             ▼       ▼             ▼       ▼             ▼
   ┌─────────┐   ┌─────────┐ ┌─────────┐  ┌─────────┐┌─────────┐  ┌─────────┐
   │Replica 0│   │Replica 1│ │Replica 0│  │Replica 1││Replica 0│  │Replica 1│
   └─────────┘   └─────────┘ └─────────┘  └─────────┘└─────────┘  └─────────┘

5.3 Write Path

Writes always go to the primary:

def create_short_url(short_code: str, original_url: str) -> URLMapping:
    partition = get_partition(short_code)
    primary = get_primary_connection(partition)
    
    result = primary.execute(
        "INSERT INTO urls (short_code, original_url, created_at) "
        "VALUES (%s, %s, NOW()) RETURNING created_at",
        (short_code, original_url)
    )
    
    return URLMapping(
        short_code=short_code,
        original_url=original_url,
        created_at=result.created_at
    )

5.4 Read Path

Reads can go to any replica:

def redirect(short_code: str) -> Optional[str]:
    # Try cache first
    cached = cache.get(f"url:{short_code}")
    if cached:
        return cached
    
    # Cache miss - read from replica
    partition = get_partition(short_code)
    replica = get_random_replica(partition)  # Load balance across replicas
    
    result = replica.execute(
        "SELECT original_url FROM urls WHERE short_code = %s",
        (short_code,)
    )
    
    if result:
        cache.setex(f"url:{short_code}", 3600, result.original_url)
        return result.original_url
    
    return None

5.5 The Replication Lag Problem

Scenario: User creates a short URL and immediately shares it.

Time 0.000s: User creates short URL "abc123" → Written to Primary 1
Time 0.001s: Primary 1 acknowledges write to user
Time 0.002s: User shares link with friend
Time 0.003s: Friend clicks link → Request routed to Replica 1
Time 0.004s: Replica 1 queries for "abc123" → NOT FOUND (replica lag ~50ms)
Time 0.005s: Friend sees 404 error
Time 0.050s: Replica 1 receives replication of "abc123"
Time 0.100s: Friend retries → Works

The friend saw a 404 for a URL that exists. This is a read-your-writes violation—but for someone else's write.

5.6 Solutions for the Stale Read Problem

Solution 1: Synchronous Replication

Configure PostgreSQL for synchronous replication. Writes don't acknowledge until at least one replica confirms.

-- postgresql.conf on primary
synchronous_commit = on
synchronous_standby_names = 'replica1'

Trade-off: Every write is slower by one network round-trip to the replica. For a URL shortener doing 1,000 creates/second, adding 1ms per write might be acceptable.

Solution 2: Read from Primary for New URLs

After creating a URL, read from primary for a short window:

def create_short_url(short_code: str, original_url: str) -> URLMapping:
    result = create_in_primary(short_code, original_url)
    
    # Mark this URL as "just created" for 5 seconds
    cache.setex(f"new:{short_code}", 5, "1")
    
    # Also cache the URL itself
    cache.setex(f"url:{short_code}", 3600, original_url)
    
    return result

def redirect(short_code: str) -> Optional[str]:
    cached = cache.get(f"url:{short_code}")
    if cached:
        return cached
    
    partition = get_partition(short_code)
    
    # Check if this is a newly created URL
    if cache.exists(f"new:{short_code}"):
        # Read from primary to ensure we see it
        db = get_primary_connection(partition)
    else:
        # Safe to read from replica
        db = get_random_replica(partition)
    
    result = db.execute(...)
    ...

Trade-off: Slightly more complex logic. Primary gets more reads for new URLs. But new URLs are also likely to be hot (just shared), so they'd probably end up cached quickly anyway.

Solution 3: Write-Through Cache

Cache the URL at creation time, not at first read:

def create_short_url(short_code: str, original_url: str) -> URLMapping:
    result = create_in_primary(short_code, original_url)
    
    # Immediately cache - redirects will hit cache, not database
    cache.setex(f"url:{short_code}", 3600, original_url)
    
    return result

Trade-off: The cache is now in the write path. If Redis is slow or down, URL creation is affected. But for a URL shortener, this is probably acceptable since Redis is already critical infrastructure.

5.7 Failover Design

What happens when a primary dies?

Automatic Failover

Use a tool like Patroni (for PostgreSQL) or orchestration built into your database:

Primary stops responding to heartbeats
Followers detect primary is gone (timeout: 10-30 seconds)
Followers elect a new primary (consensus among remaining nodes)
Connection poolers (PgBouncer) are updated to point to new primary
Old primary, if it recovers, joins as a follower

Before failover:
   Primary ──▶ Replica1, Replica2

Primary dies:

After failover:
   Replica1 (promoted) ──▶ Replica2
   Old Primary (if recovered, joins as replica)

Handling Writes During Failover

During the failover window (10-60 seconds), writes fail. Options:

Return errors: Tell users "please retry." Simple but bad UX.
Queue writes: Accept writes into a queue, apply when new primary is elected. Complex, risk of inconsistency.
Fast failover: Optimize detection and promotion to minimize window. Best practical approach.

For a URL shortener, option 1 is probably fine. Users can retry creating a URL. It's not like a payment that might double-charge.

Handling Reads During Failover

If the primary dies but replicas are healthy, reads continue to work (assuming you don't require reading from primary).

This is a major benefit of read replicas: read availability during primary failures.

Chapter 6: Replication in Multi-Region Deployments

6.1 The Latency Problem

Your URL shortener is successful. Users are global. But all infrastructure is in us-east-1.

User in Tokyo → us-east-1 → 150ms round trip
User in London → us-east-1 → 80ms round trip
User in São Paulo → us-east-1 → 120ms round trip

For redirects (the hot path), this latency is painful. Users expect near-instant redirects.

6.2 Read Replicas in Multiple Regions

Deploy read replicas in each major region:

                        ┌─────────────────────┐
                        │   us-east-1         │
                        │   Primary Cluster   │
                        └──────────┬──────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              │                    │                    │
              ▼                    ▼                    ▼
     ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
     │    eu-west-1    │  │  ap-northeast-1 │  │   sa-east-1     │
     │  Read Replicas  │  │  Read Replicas  │  │  Read Replicas  │
     └─────────────────┘  └─────────────────┘  └─────────────────┘
           Europe              Tokyo              São Paulo

Reads: Served from local region (~10ms) Writes: Go to us-east-1 (~80-150ms)

6.3 Dealing with Cross-Region Replication Lag

Cross-region replication has higher lag than in-region (50-200ms typically).

The problem intensifies: User in Tokyo creates URL, shares with friend in Tokyo. Friend clicks. Local Tokyo replica doesn't have the URL yet.

Solutions:

Write-through to local cache: When URL is created, synchronously populate the Tokyo cache.
Route creator's reads to primary region: For a few seconds after creation, route that user's reads to us-east-1.
Accept the edge case: For a URL shortener, the friend can retry in 500ms. It's not ideal but might be acceptable.

6.4 Multi-Region Active-Active

For write-heavy workloads or true global availability, you might want active-active:

     ┌─────────────────┐            ┌─────────────────┐
     │   us-east-1     │◀──────────▶│   eu-west-1     │
     │   Primary       │  Bi-dir    │   Primary       │
     │   (Americas)    │  Repl.     │   (Europe)      │
     └─────────────────┘            └─────────────────┘

Users in Americas write to us-east-1. Users in Europe write to eu-west-1. Changes replicate both directions.

The conflict problem: Same URL created in both regions simultaneously? With a URL shortener, this is unlikely (short codes are random), but you'd need conflict resolution if it happens.

Simpler alternative for URL shorteners: Keep single-primary, accept the write latency. URL creation is infrequent compared to redirects. A 150ms create is acceptable if redirects are 10ms.

Part III: Advanced Topics

Chapter 7: Replication Internals

Understanding how replication works under the hood helps you debug issues and make informed decisions.

7.1 Write-Ahead Log (WAL) Replication

Most databases (PostgreSQL, MySQL, MongoDB) use log-based replication:

Every write is first recorded in a sequential log (WAL)
The log is shipped to replicas
Replicas apply the log entries to their data

Primary:
  Write ─▶ WAL ─▶ Apply to data
              │
              └──▶ Send to replicas

Replica:
  Receive WAL ─▶ Apply to data

Why WAL?

Sequential writes are fast (disk optimization)
Log captures exact changes (no interpretation needed)
Replicas can catch up by replaying log from any point

7.2 Logical vs Physical Replication

Physical replication: Ship the raw WAL bytes. Replica is byte-for-byte identical to primary. Fast, but tied to same database version.

Logical replication: Ship logical changes (INSERT/UPDATE/DELETE with values). More flexible, can replicate between different versions or even different databases. Slightly slower.

Physical: "Write these bytes to page 1234 at offset 567"
Logical:  "INSERT INTO users (id, name) VALUES (1, 'Alice')"

7.3 Replication Slots and Catchup

What if a replica goes offline for an hour?

Without replication slots: Primary might delete old WAL segments (to reclaim disk). Replica comes back and can't catch up—needs full resync.

With replication slots: Primary retains WAL needed by each replica. Replica can always catch up, but disk usage grows if replica is down too long.

# Monitor replication slot disk usage
SELECT slot_name, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
FROM pg_replication_slots;

Chapter 8: Handling Failover Correctly

Failover sounds simple: promote a replica when the primary dies. In practice, it's full of edge cases.

8.1 The Split-Brain Problem

Network partition isolates the primary:

       ┌─────────────────────────────────────────┐
       │  Network Partition                      │
       │                                         │
┌──────┴──────┐                         ┌────────┴────────┐
│   Primary   │    X    X    X          │    Replicas     │
│  (isolated) │                         │   + Application │
└─────────────┘                         └─────────────────┘
       │                                         │
    Clients A                                 Clients B

Clients A continue writing to the old primary
Replicas can't reach primary, promote a new one
Clients B write to the new primary
Network heals: two primaries with conflicting data

8.2 Fencing: The Solution to Split-Brain

Fencing ensures the old primary can't accept writes after failover:

STONITH (Shoot The Other Node In The Head): Literally power off the old primary via IPMI/BMC.
Storage fencing: Revoke the old primary's access to shared storage.
Network fencing: Reconfigure network rules to block the old primary.
Lease-based: Primary must regularly renew a lease (in ZooKeeper/etcd). If it can't reach the lease store, it stops accepting writes.

# Lease-based fencing pseudocode
class Primary:
    def __init__(self):
        self.lease = self.acquire_lease()
    
    def accept_write(self, write):
        if not self.lease.is_valid():
            raise Exception("Lost leadership, rejecting write")
        
        self.execute(write)
    
    def background_loop(self):
        while True:
            try:
                self.lease.renew()
            except LeaseRenewalFailed:
                self.demote_to_read_only()
            sleep(lease_ttl / 3)

8.3 Failover Runbook

A real failover runbook for our URL shortener:

FAILOVER PROCEDURE: Partition X Primary Failure

1. DETECT
   - Alert: "Primary X not responding to health checks for 30 seconds"
   - Verify: Check if network issue or actual server failure
   
2. DECIDE
   - If network issue: Do NOT failover, wait for network recovery
   - If server failure: Proceed with failover

3. FENCE THE OLD PRIMARY
   - Remove from load balancer
   - If possible, power off or isolate network
   
4. SELECT NEW PRIMARY
   - Choose replica with least replication lag
   - Verify replica is healthy
   
5. PROMOTE
   - Run: pg_ctl promote -D /data/postgres
   - Verify: SELECT pg_is_in_recovery(); -- should return false
   
6. RECONFIGURE
   - Update connection strings / PgBouncer
   - Remaining replicas point to new primary
   
7. VERIFY
   - Test write: Create test URL
   - Test read: Query for test URL
   - Monitor replication from new primary to replicas
   
8. CLEANUP
   - When old primary recovers: reinitialize as replica
   - Post-incident review within 24 hours
   
ESTIMATED DOWNTIME: 30-60 seconds (writes only; reads continue)

Part IV: Discussion and Trade-offs

Chapter 9: The Hard Questions

These are the questions you should be asking (and answering) in your design sessions.

9.1 "User creates URL, immediately shares it, friend clicks—what happens with async replication?"

This is the core replication lag problem. Your answer should include:

Acknowledge the problem: Yes, with async replication, the friend might get a 404.
Quantify the risk: Replication lag is typically <100ms. The friend would have to click within 100ms of the share—unlikely but possible.
Propose mitigations:
- Write-through cache at creation time (friend hits cache, not replica)
- Semi-sync replication (wait for one replica before acknowledging)
- Accept the edge case (friend retries, it works)
Make a recommendation: "For a URL shortener, I'd use write-through cache. It's simple, and we already have Redis in the architecture. The URL is cached before the user even finishes sharing."

9.2 "When would you accept stale reads? When not?"

Accept stale reads:

View counters, like counts (approximate is fine)
News feeds (seeing a post 30 seconds late doesn't matter)
Product listings (prices don't change every second)
Analytics dashboards (real-time accuracy isn't required)

Reject stale reads:

Account balances (stale balance → wrong financial decisions)
Inventory counts (stale inventory → overselling)
Permission checks (stale permissions → security hole)
User authentication state (stale logout → security issue)

The URL shortener case:

URL resolution: Can be eventually consistent. A URL created 1 second ago can tolerate <100ms lag.
URL creation: Must be strongly consistent. User needs confirmation that the URL exists.
URL deletion: Tricky. If we delete but replicas still serve it, is that a problem? For most URL shorteners, yes—the user deleted it for a reason.

9.3 "What replication factor would you use?"

Replication factor = number of copies of each piece of data

Common choices:

RF=1: No replication. Don't do this in production.
RF=2: Can survive 1 node failure. Common for cost-sensitive workloads.
RF=3: Can survive 1 failure with no degradation (quorum still possible). Standard choice.
RF=5: Can survive 2 failures. For highly critical data.

For the URL shortener: RF=3 is appropriate. It's standard for production workloads, allows maintenance without risk, and the storage overhead is acceptable.

9.4 "How do you handle a replica that's fallen hours behind?"

Detection: Monitor replication lag. Alert if lag exceeds threshold (e.g., 1 minute).

Diagnosis: Why is it behind?

Network issue: Fix network, let it catch up.
Slow disk: Replica can't apply changes fast enough.
Long-running query: Blocking replication on the replica.

Recovery options:

Let it catch up: If the primary is retaining WAL, the replica will eventually catch up.
Remove from rotation: Stop sending reads to this replica until it catches up.
Re-sync from scratch: If too far behind or WAL is gone, rebuild the replica from a backup.

For URL shortener: Remove from read rotation immediately. A replica hours behind is serving stale data. If it's more than 1 hour behind, consider rebuilding—faster than catching up.

Chapter 10: Session Summary

What You Should Know Now

After this session, you should be able to:

Explain leader-follower, multi-leader, and leaderless architectures with trade-offs
Design a replication strategy based on consistency and availability requirements
Identify replication lag anomalies and their solutions
Handle failover scenarios with explicit fencing strategies
Extend a partitioned system with replication (the URL shortener)

Key Trade-offs to Remember

Decision	Trade-off
Sync vs Async replication	Consistency vs Write latency
More replicas	Better read scaling vs Higher write replication cost
Leader-follower vs Multi-leader	Simplicity vs Multi-region write latency
Strong vs Eventual consistency	Correctness vs Performance/Availability
Higher replication factor	Better durability vs Storage and write cost

Questions to Ask in Every Design

What consistency level does this data actually need?
What's the acceptable replication lag?
What happens during failover? How long is the outage?
Can readers tolerate stale data? For how long?
What's the strategy for a replica that falls far behind?

Part V: Interview Questions and Answers

Chapter 11: Real-World Interview Scenarios

11.1 Conceptual Questions

Question 1: "Explain the difference between synchronous and asynchronous replication."

Interviewer's Intent: Testing fundamental understanding.

Strong Answer:

"In synchronous replication, the primary waits for at least one replica to confirm the write before acknowledging to the client. This guarantees that if the primary dies, no acknowledged writes are lost. The cost is latency—every write incurs a network round-trip to the replica.

In asynchronous replication, the primary acknowledges immediately and replicates in the background. Writes are faster, but if the primary dies before replicating, those writes are lost.

Most production systems use semi-synchronous: wait for one replica, replicate to others asynchronously. This balances durability with performance. PostgreSQL with synchronous_commit = on and one sync standby is a common configuration."

Question 2: "What is replication lag and why does it matter?"

Interviewer's Intent: Testing understanding of practical implications.

Strong Answer:

"Replication lag is the delay between a write on the primary and that write appearing on a replica. In normal conditions, it's milliseconds. Under load or with network issues, it can grow to seconds or minutes.

It matters because reads from lagging replicas return stale data. This causes anomalies: a user updates their profile, refreshes, and sees the old version because their read hit a lagging replica.

There are specific consistency violations to watch for:

Read-your-writes: User doesn't see their own recent write
Monotonic reads: User sees data go backwards (fresher replica, then lagging one)
Consistent prefix: Causally related writes appear out of order

Mitigations include sticky sessions, reading from primary for recent writes, and write-through caching."

Question 3: "What is split-brain and how do you prevent it?"

Interviewer's Intent: Testing understanding of distributed systems failure modes.

Strong Answer:

"Split-brain occurs when a network partition causes two nodes to both believe they're the primary. Each accepts writes, and when the partition heals, you have conflicting data.

Prevention requires fencing—ensuring the old primary can't accept writes after a new primary is elected. Common approaches:

STONITH—literally power off the old primary via out-of-band management
Lease-based leadership—the primary must regularly renew a lease in a consensus system like ZooKeeper. If it can't reach the lease store, it stops accepting writes
Storage fencing—revoke the old primary's access to shared storage

I'd use lease-based fencing for most applications. It's less invasive than STONITH and works well with cloud infrastructure where IPMI access isn't always available."

Question 4: "Explain the CAP theorem in practical terms."

Interviewer's Intent: Testing ability to translate theory to practice.

Strong Answer:

"CAP says you can't have Consistency, Availability, and Partition tolerance simultaneously. Since network partitions will happen, you're really choosing between consistency and availability during a partition.

In practice, it's not binary—it's a spectrum. For example:

During normal operation, you can have both C and A
During a partition, you choose: reject writes (consistent but unavailable) or accept writes (available but potentially inconsistent)

Modern systems let you tune this per operation. DynamoDB lets you choose strongly consistent or eventually consistent reads. Cassandra lets you set quorum levels.

For the URL shortener example: I'd choose availability for reads (serve from any replica, even if stale) but consistency for URL creation (write must be durable before acknowledging). Different operations, different CAP positions."

11.2 Design Questions

Question 5: "Design the replication strategy for a global user profile service."

Interviewer's Intent: Testing end-to-end replication design.

Strong Answer:

"First, let me understand the access patterns. User profiles are read-heavy—users view profiles constantly, update rarely. Reads should be low-latency globally.

Architecture: Leader-follower with regional read replicas.

                  Primary (us-east-1)
                         │
       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
   EU Replicas      Asia Replicas     SA Replicas

Writes go to the primary in us-east-1. Users accept the latency for updates since they're infrequent.

Reads go to the nearest regional replica. Replication is asynchronous—users can tolerate seeing their profile update a few hundred milliseconds late.

For read-your-writes consistency: when a user updates their profile, I'd write-through to a global cache (something like Redis Global Datastore or a CDN cache). The next read hits the cache, not the lagging replica.

For cross-region failover: if us-east-1 goes down, I'd fail over to eu-west-1. This requires promotion of a replica to primary and reconfiguring the other replicas. Failover time would be 30-60 seconds with automation.

Replication factor: 2 replicas per region (total of 7 nodes including primary). This gives fault tolerance within each region and allows maintenance without risk."

Question 6: "How would you handle replication for a financial ledger?"

Interviewer's Intent: Testing understanding of strong consistency requirements.

Strong Answer:

"Financial ledgers are the poster child for strong consistency. An incorrect balance or double-spend is unacceptable. I'd design very differently than the profile service.

Replication strategy: Synchronous replication to at least one replica before acknowledging any write. I'd use a consensus protocol (Raft, Paxos) rather than simple leader-follower to guarantee no data loss during failover.

Write ──▶ Primary ──▶ Sync Replica 1 ──▶ ACK
              │
              └──▶ Async Replica 2, 3 (for read scaling)

Consistency: All reads go to the primary or sync replicas only. I'd never serve a balance from an async replica.

Availability trade-off: During a partition, I'd reject writes rather than risk inconsistency. Financial correctness > availability. Users see an error message, but their money is safe.

Failover: Use consensus-based election. The new leader is guaranteed to have all committed transactions. Fencing is critical—I'd use lease-based fencing with a short TTL (5-10 seconds).

For audit compliance: Keep the replication log (WAL) immutable and archived. This provides a complete history of all transactions for regulatory review.

The key insight: this is exactly the opposite of the profile service. Strong consistency, sacrifice availability, synchronous replication. The requirements drive fundamentally different designs."

11.3 Scenario-Based Questions

Question 7: "Your replica is 2 hours behind. Walk me through how you'd handle this."

Interviewer's Intent: Testing operational skills.

Strong Answer:

"First, I'd remove it from the read rotation immediately. A replica 2 hours behind is serving data that's 2 hours stale—that's unacceptable for most applications.

Then diagnosis:

Check network: Is there a network issue between primary and replica? I'd check packet loss, latency, bandwidth utilization.
Check replica health: Is the replica CPU/IO bound? Is something blocking replication apply (long-running queries, locks)?
Check primary: Is the primary WAL generation unusually high? Maybe there's a bulk load or migration running.

For recovery, it depends on the cause:

If it's a temporary issue (network blip), and the primary is retaining WAL, I'd let it catch up. Monitor the lag—it should decrease steadily.
If the replica can't keep up (undersized hardware), I'd either upgrade it or rebuild on better hardware.
If the primary has already deleted the required WAL (no replication slots), I'd need to rebuild the replica from a fresh backup.

The decision point: if catching up will take longer than rebuilding (say, lag > 24 hours), rebuilding is often faster with modern tooling like pg_basebackup.

Throughout, I'd communicate status to the team and update our incident channel. A degraded replica affects our read capacity and fault tolerance."

Question 8: "Your primary database just died. Walk me through the failover process."

Interviewer's Intent: Testing incident response skills.

Strong Answer:

"Assuming we have automated failover configured (and we should), here's what happens and what I'd verify:

Automated steps:

Health checks detect primary is unresponsive (typically 10-30 second timeout)
Failover coordinator (Patroni, RDS, etc.) initiates failover
Sync replica is promoted to primary
Connection poolers/load balancers are updated
Other replicas are reconfigured to follow the new primary

My immediate actions:

Acknowledge the alert, join incident channel
Verify failover completed: Can we write to the new primary?
Check application health: Are services reconnecting successfully?
Monitor for any data issues: Did we lose any writes?

If automated failover didn't trigger:

Manually verify primary is actually down (not just monitoring failure)
Fence the old primary (remove from network/load balancer)
Identify the most up-to-date replica
Promote it manually: pg_ctl promote or equivalent
Update connection configuration
Verify writes work

Post-incident:

Investigate why primary died (hardware? OOM? disk full?)
When old primary is recovered, reinitialize it as a replica
Write up incident report with timeline and learnings
If automated failover failed, fix the automation

Expected timeline: 30-60 seconds for automated failover, 5-15 minutes for manual if automation fails."

Question 9: "How would you test your replication and failover setup?"

Interviewer's Intent: Testing proactive reliability thinking.

Strong Answer:

"Testing replication should happen regularly, not just after setup.

Continuous monitoring:

Replication lag metrics on all replicas (alert if > threshold)
Replica health checks (is the replica applying WAL?)
Primary WAL generation rate (unusual spike might indicate issues)

Regular testing (monthly):

Planned failover drill: Deliberately promote a replica, verify applications handle it gracefully, then fail back.
Chaos engineering: Kill the primary unexpectedly (in staging), verify automated failover works.
Lag simulation: Introduce network latency to a replica, verify monitoring detects it and the replica is removed from rotation.

Pre-production validation:

Before any replication config change, test in staging
Verify both automated and manual failover work
Test the scenario where automated failover fails and you need manual intervention

Application-level testing:

Verify applications reconnect after failover (connection string changes, timeouts)
Test that applications handle brief write unavailability gracefully
Verify no data corruption after failover

Documentation:

Runbook for manual failover should be tested annually by someone who wasn't involved in writing it
If they can't follow it, the runbook is broken

The goal is confidence: when a real failover happens at 3 AM, you know it will work because you've tested it repeatedly."

11.4 Deep-Dive Questions

Question 10: "Compare replication in PostgreSQL, MySQL, and MongoDB."

Interviewer's Intent: Testing breadth of knowledge.

Strong Answer:

"PostgreSQL uses streaming replication by default. WAL segments are shipped to replicas over a persistent connection. It supports synchronous replication (wait for N replicas), replication slots (ensure WAL is retained for slow replicas), and logical replication (for cross-version or selective replication).

Strengths: Very robust, battle-tested. Synchronous replication is straightforward. Logical replication allows flexible topologies. Weaknesses: Failover isn't built-in—you need external tools like Patroni or Stolon.

MySQL uses binary log replication. Similar concept to WAL, but MySQL has more replication modes: statement-based (replicates SQL), row-based (replicates actual data changes), or mixed. MySQL Group Replication adds consensus-based replication for automatic failover.

Strengths: Flexible replication formats. Group Replication provides built-in automatic failover. Weaknesses: Historically more complex to configure correctly. GTID (Global Transaction IDs) make it better, but legacy setups can be tricky.

MongoDB uses replica sets: a primary and multiple secondaries with automatic failover via Raft consensus. Writes go to primary; reads can go to secondaries with various read preferences.

Strengths: Built-in automatic failover, no external tools needed. Read preference is configurable per query. Weaknesses: Write concern configuration can be confusing. Default settings prioritize availability over durability.

For our URL shortener, I'd likely use PostgreSQL with Patroni for automatic failover. It's what I know best, and the tooling is mature. But MongoDB's built-in replica sets would also work well—simpler to set up, automatic failover out of the box."

Question 11: "What is quorum-based replication and when would you use it?"

Interviewer's Intent: Testing understanding of consistency mechanisms.

Strong Answer:

"Quorum replication requires a minimum number of nodes to agree before considering an operation complete. With N nodes, you configure W (write quorum) and R (read quorum). If W + R > N, reads always see the latest write because at least one node participates in both.

Example with N=5:

W=3, R=3: Majority for both reads and writes. Strong consistency.
W=1, R=5: Fast writes, expensive reads. All nodes must respond to read.
W=5, R=1: Expensive writes (all nodes), fast reads (any single node).

Cassandra and DynamoDB use this model. Common settings:

QUORUM (W=majority, R=majority): Balanced consistency/performance
ONE (W=1, R=1): Fastest, but eventual consistency
ALL (W=all, R=all): Strongest, but any node failure blocks operations

When to use it:

When you need tunable consistency per operation
When you want availability during partial failures (quorum can succeed without all nodes)
For distributed databases without a single leader

Trade-offs:

Higher latency (waiting for multiple nodes)
Tail latency is the slowest node in the quorum
More complex client logic

For the URL shortener, I probably wouldn't use quorum-based replication. Leader-follower is simpler and sufficient. But if I were building a globally distributed key-value store where each region needs to accept writes, quorum-based approaches like Dynamo-style replication would be appropriate."

Chapter 12: Interview Preparation Checklist

Before your interview, make sure you can:

Concepts

Explain sync vs async replication and their trade-offs
Describe leader-follower, multi-leader, and leaderless architectures
Define replication lag and its anomalies (read-your-writes, monotonic reads)
Explain split-brain and fencing mechanisms

Design Skills

Add replication to a partitioned system design
Choose consistency levels based on requirements
Design for multi-region deployment
Handle failover scenarios explicitly

Operational Knowledge

Walk through a failover runbook
Diagnose and resolve a lagging replica
Test replication and failover procedures
Monitor replication health

Real Systems

Know replication in PostgreSQL, MySQL, or MongoDB (at least one well)
Understand quorum-based systems (Cassandra, DynamoDB)
Cite real-world examples (how does X company handle this?)

Exercises

Exercise 1: Consistency Level Selection

For each scenario, choose the appropriate consistency level and justify:

Twitter-like feed: User posts a tweet, expects to see it immediately in their own feed
Inventory count: E-commerce site showing "Only 3 left in stock!"
Like counter: Showing "1.2M likes" on a viral post
Bank transfer: Moving money between accounts
Password change: User resets password, tries to log in immediately

Exercise 2: Failover Design

Design the failover procedure for a three-node PostgreSQL cluster (1 primary, 2 replicas) running a payment processing system. Consider:

Detection time
Fencing mechanism
Promotion criteria
Application impact
Data loss risk

Exercise 3: Replication Lag Mitigation

Your application has a flow: user submits a form, you save to the database, then redirect to a "success" page that reads the data. Users are occasionally seeing "not found" errors. Design three different solutions with trade-offs for each.

Appendix: Code Reference

A.1 Replication Lag Monitoring

from dataclasses import dataclass
from typing import List, Optional
import time

@dataclass
class ReplicaStatus:
    hostname: str
    lag_bytes: int
    lag_seconds: float
    state: str  # 'streaming', 'catchup', 'disconnected'
    
class ReplicationMonitor:
    def __init__(self, primary_conn, alert_threshold_seconds: float = 30):
        self.primary_conn = primary_conn
        self.alert_threshold = alert_threshold_seconds
    
    def get_replica_status(self) -> List[ReplicaStatus]:
        """Query primary for replication status."""
        query = """
            SELECT 
                client_addr as hostname,
                pg_wal_lsn_diff(sent_lsn, replay_lsn) as lag_bytes,
                EXTRACT(EPOCH FROM (now() - last_msg_receipt_time)) as lag_seconds,
                state
            FROM pg_stat_replication
        """
        results = self.primary_conn.execute(query)
        
        return [
            ReplicaStatus(
                hostname=r['hostname'],
                lag_bytes=r['lag_bytes'],
                lag_seconds=r['lag_seconds'],
                state=r['state']
            )
            for r in results
        ]
    
    def check_health(self) -> List[str]:
        """Return list of alerts for unhealthy replicas."""
        alerts = []
        replicas = self.get_replica_status()
        
        for replica in replicas:
            if replica.state != 'streaming':
                alerts.append(f"{replica.hostname}: State is {replica.state}, not streaming")
            elif replica.lag_seconds > self.alert_threshold:
                alerts.append(f"{replica.hostname}: Lag is {replica.lag_seconds:.1f}s (threshold: {self.alert_threshold}s)")
        
        return alerts

A.2 Read-Your-Writes Consistency Helper

import time
from typing import Optional
import redis

class ReadYourWritesRouter:
    """
    Routes reads to primary for recently-written keys,
    otherwise to replicas.
    """
    
    def __init__(
        self, 
        primary_conn,
        replica_pool,
        cache: redis.Redis,
        stickiness_seconds: float = 5.0
    ):
        self.primary = primary_conn
        self.replicas = replica_pool
        self.cache = cache
        self.stickiness_seconds = stickiness_seconds
    
    def record_write(self, key: str):
        """Call this after every write to track the key."""
        # Store timestamp of write
        self.cache.setex(
            f"written:{key}",
            int(self.stickiness_seconds),
            str(time.time())
        )
    
    def get_connection_for_read(self, key: str):
        """
        Returns primary connection if key was recently written,
        otherwise returns a replica connection.
        """
        written_at = self.cache.get(f"written:{key}")
        
        if written_at:
            # Recently written - use primary
            return self.primary
        else:
            # Safe to use replica
            return self.replicas.get_random()
    
    def read(self, key: str, query: str, params: tuple) -> Optional[dict]:
        """Execute a read with appropriate routing."""
        conn = self.get_connection_for_read(key)
        return conn.execute(query, params)
    
    def write(self, key: str, query: str, params: tuple) -> dict:
        """Execute a write and track it."""
        result = self.primary.execute(query, params)
        self.record_write(key)
        return result


# Usage example
router = ReadYourWritesRouter(primary, replica_pool, redis_client)

# Write - goes to primary, tracked for 5 seconds
router.write(
    key="user:123",
    query="UPDATE users SET email = %s WHERE id = %s",
    params=("new@example.com", 123)
)

# Immediate read - routed to primary because of recent write
result = router.read(
    key="user:123",
    query="SELECT * FROM users WHERE id = %s",
    params=(123,)
)

# Read after 5 seconds - safe to go to replica
time.sleep(5)
result = router.read(...)  # Goes to replica

A.3 Automatic Failover Detection

import time
import threading
from enum import Enum
from typing import Callable, Optional

class NodeState(Enum):
    PRIMARY = "primary"
    REPLICA = "replica"
    FAILED = "failed"
    UNKNOWN = "unknown"

class FailoverDetector:
    """
    Monitors primary health and triggers failover callback
    when primary is determined to be down.
    """
    
    def __init__(
        self,
        primary_conn,
        health_check_interval: float = 1.0,
        failure_threshold: int = 3,
        on_failover: Optional[Callable] = None
    ):
        self.primary = primary_conn
        self.interval = health_check_interval
        self.threshold = failure_threshold
        self.on_failover = on_failover
        
        self.consecutive_failures = 0
        self.state = NodeState.UNKNOWN
        self._running = False
        self._thread = None
    
    def health_check(self) -> bool:
        """Returns True if primary is healthy."""
        try:
            result = self.primary.execute("SELECT 1", timeout=1.0)
            return result is not None
        except Exception:
            return False
    
    def _monitor_loop(self):
        while self._running:
            healthy = self.health_check()
            
            if healthy:
                self.consecutive_failures = 0
                self.state = NodeState.PRIMARY
            else:
                self.consecutive_failures += 1
                
                if self.consecutive_failures >= self.threshold:
                    self.state = NodeState.FAILED
                    
                    if self.on_failover:
                        self.on_failover()
                    
                    # Reset after triggering
                    self.consecutive_failures = 0
            
            time.sleep(self.interval)
    
    def start(self):
        self._running = True
        self._thread = threading.Thread(target=self._monitor_loop, daemon=True)
        self._thread.start()
    
    def stop(self):
        self._running = False
        if self._thread:
            self._thread.join()


# Usage
def handle_failover():
    print("Primary failed! Initiating failover...")
    # 1. Fence old primary
    # 2. Promote replica
    # 3. Update connection strings
    # 4. Alert on-call

detector = FailoverDetector(
    primary_conn=primary,
    health_check_interval=1.0,
    failure_threshold=3,
    on_failover=handle_failover
)

detector.start()

End of Day 2: Replication Trade-offs

Tomorrow: Day 3 — Rate Limiting at Scale. We'll design a distributed rate limiter that handles 10K requests/second across 50 servers, and confront the question: when a node fails, do you over-allow or under-allow?

Back to Course Overview