Week 1 — Day 2: Replication Trade-offs
System Design Mastery Series
Preface
Yesterday, you learned how to split data across machines. Today, you learn how to copy it.
Replication seems simple: keep copies of data on multiple machines. But the moment you have copies, you face the hardest questions in distributed systems:
- Which copy is the "truth"?
- How long can copies be out of sync?
- What happens when the machine holding the truth dies?
By the end of this session, you won't just know what replication is—you'll understand exactly what you're trading with every replication decision, and you'll be able to design systems that handle these trade-offs explicitly.
Let's begin.
Part I: Foundations
Chapter 1: Why Replication Matters
1.1 The Three Reasons to Replicate
Every replicated system exists for one or more of these reasons:
Reason 1: High Availability
If your database server dies, your application dies with it. With replicas, you can failover to another copy and keep serving requests.
The math: A single server with 99.9% uptime has about 8.7 hours of downtime per year. Two servers, either of which can serve requests, have combined uptime of 99.9999%—about 31 seconds of downtime per year (assuming independent failures).
Reason 2: Read Scaling
A single database might handle 10,000 reads per second. With 5 read replicas, you can handle 50,000 reads per second. Writes still go to one place, but reads spread across all copies.
When this works: Read-heavy workloads (10:1 or higher read-to-write ratio). Most web applications qualify.
When this doesn't work: Write-heavy workloads. Replication doesn't help with write throughput—every write must go to every replica.
Reason 3: Geographic Distribution
Users in Tokyo shouldn't wait for a round trip to a server in Virginia. With replicas in multiple regions, users read from nearby copies.
The physics: Light travels ~200km per millisecond in fiber. Tokyo to Virginia is ~11,000km = ~55ms minimum latency each way. A local replica reduces this to <10ms.
1.2 The Fundamental Trade-off
Here's the truth about replication that every engineer must internalize:
You cannot have instantaneous, consistent replication without sacrificing availability or performance.
This isn't a limitation of current technology. It's physics and mathematics. The CAP theorem formalizes this, but the intuition is simple:
- Keeping replicas perfectly in sync requires coordination
- Coordination takes time (network round trips)
- During that time, you're either waiting (slower) or serving potentially stale data (inconsistent)
Every replication strategy is a different point on this trade-off spectrum. Your job is to choose the right point for your system.
Chapter 2: Replication Architectures
There are three fundamental architectures for replication. Each makes different trade-offs.
2.1 Leader-Follower (Primary-Replica)
How It Works
One node is the leader (also called primary or master). All writes go to the leader. The leader then sends changes to followers (also called replicas or secondaries). Reads can go to the leader or any follower.
Writes
│
▼
┌──────────┐
│ Leader │
└────┬─────┘
│ Replication stream
┌─────┼─────┐
▼ ▼ ▼
┌────────┐┌────────┐┌────────┐
│Follower││Follower││Follower│
└────────┘└────────┘└────────┘
▲ ▲ ▲
│ │ │
Reads (distributed)
Synchronous vs Asynchronous
Synchronous replication: Leader waits for follower(s) to confirm the write before acknowledging to the client.
Client ──Write──▶ Leader ──Replicate──▶ Follower
│ │
│◀──────Confirm────────┤
│
Client ◀───ACK─────┤
Pros:
- Followers are always up-to-date
- No data loss on leader failure
Cons:
- Write latency includes network round-trip to follower
- If follower is slow or dead, writes stall
Asynchronous replication: Leader acknowledges to client immediately, replicates in the background.
Client ──Write──▶ Leader ──ACK──▶ Client
│
│ (background)
▼
Follower
Pros:
- Fast writes (no waiting for followers)
- Followers being slow doesn't affect clients
Cons:
- Followers may lag behind leader
- Data loss possible on leader failure (unreplicated writes)
Semi-Synchronous: The Practical Middle Ground
Most production systems use semi-synchronous replication: wait for at least one follower to confirm, then acknowledge to client. Other followers replicate asynchronously.
Client ──Write──▶ Leader ──Replicate──▶ Follower 1 (sync)
│ │
│◀──────Confirm────────┤
│
Client ◀───ACK─────┤
│
▼ (async)
Follower 2, 3, ...
This gives you:
- Durability: At least one replica has every committed write
- Performance: Only wait for the fastest follower
- Availability: Can tolerate slow followers
When Leader-Follower Breaks
Failure mode 1: Leader dies
With async replication, some writes may not have reached any follower. Those writes are lost.
With sync replication, a follower can be promoted to leader without data loss. But during failover (typically 10-60 seconds), writes fail.
Failure mode 2: Split brain
Network partition isolates the leader. Followers think it's dead and elect a new leader. Now you have two leaders accepting writes. When the network heals, you have conflicting data.
Failure mode 3: Replication lag
Under heavy load, followers fall behind. A read from a follower returns stale data. In extreme cases, followers can be minutes or hours behind.
2.2 Multi-Leader (Multi-Master)
How It Works
Multiple nodes accept writes. Each leader replicates to the others.
Writes Writes
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Leader 1 │◀──────▶│ Leader 2 │
└────┬─────┘ └────┬─────┘
│ Replication │
│ ┌────────────────┐│
│ │ ││
▼ ▼ ▼▼
┌──────────┐ ┌──────────┐
│ Follower │ │ Follower │
└──────────┘ └──────────┘
Why Use Multi-Leader?
Geographic distribution: Users in Europe write to the Europe leader; users in US write to the US leader. Each region has low-latency writes.
Datacenter failover: If one datacenter goes down, the other continues accepting writes. No failover delay.
The Conflict Problem
When two leaders accept conflicting writes, what happens?
Scenario: User updates their email on their phone (hits EU leader) and their laptop (hits US leader) at the same time.
EU Leader: UPDATE users SET email='new@eu.com' WHERE id=123
US Leader: UPDATE users SET email='new@us.com' WHERE id=123
Both succeed locally. When they replicate to each other, which one wins?
Conflict resolution strategies:
-
Last-write-wins (LWW): Use timestamps; latest write wins. Simple but can lose data if clocks disagree.
-
Merge: For some data types, merge is possible. CRDTs (Conflict-free Replicated Data Types) are designed for this.
-
Application-level resolution: Notify the application of the conflict; let it decide.
-
Avoid conflicts: Design the system so conflicts are rare. For example, each user's data is primarily written in one region.
When Multi-Leader Breaks
Conflict resolution is hard: Most teams underestimate the complexity. LWW causes data loss. Custom resolution requires significant engineering.
Operational complexity: Debugging issues across multiple leaders is harder. "Which leader was this written to?" becomes a real question.
Consistency is weak: By design, you accept that replicas diverge temporarily. Some applications can't tolerate this.
2.3 Leaderless (Dynamo-style)
How It Works
No designated leader. Clients write to multiple nodes directly. Reads also go to multiple nodes, and the client resolves differences.
Write
│
┌──────┼──────┐
▼ ▼ ▼
┌─────┐┌─────┐┌─────┐
│Node1││Node2││Node3│
└─────┘└─────┘└─────┘
│ │ │
└──────┼──────┘
│
Read (quorum)
Quorums: The Math of Consistency
With N replicas, you configure:
- W: Number of nodes that must confirm a write
- R: Number of nodes that must respond to a read
If W + R > N, every read is guaranteed to see the latest write (at least one node overlaps).
Example with N=3:
- W=2, R=2: Read and write quorums overlap → strong consistency
- W=1, R=1: No overlap → eventual consistency, but faster
Nodes: [A] [B] [C]
Write (W=2): Write to A ✓, B ✓, C pending
Read (R=2): Read from B ✓, C ✓
B was in both → B has the latest write → read sees it
Sloppy Quorums and Hinted Handoff
What if nodes are unavailable? Strict quorums would fail the request. Sloppy quorums allow writing to any available nodes, with hinted handoff—the temporary node holds the write and forwards it when the proper node recovers.
This improves availability but weakens consistency guarantees.
When Leaderless Breaks
Clock skew: Many leaderless systems use timestamps for conflict resolution. Clock skew can cause newer writes to be overwritten by older ones.
Complexity: Clients must handle quorum logic, conflict resolution, and read repair. The complexity shifts to the client or a thick client library.
Tail latency: Waiting for multiple nodes (quorum) means your latency is the slowest responder, not the average.
Chapter 3: Replication Lag and Its Consequences
3.1 What Is Replication Lag?
In asynchronous replication, followers don't have the latest data immediately. The delay between a write on the leader and that write appearing on a follower is replication lag.
Normal conditions: Milliseconds to low seconds.
Under load: Can grow to minutes or longer.
Failure conditions: If a follower is disconnected, lag is infinite until it reconnects.
3.2 Anomalies Caused by Replication Lag
Anomaly 1: Read Your Own Writes
User submits a form. The write goes to the leader. User's next request reads from a follower that hasn't received the write yet. User sees the old data and thinks their submission failed.
Time 0: User writes "new email" → Leader
Time 1: User reads profile → Follower (lag: hasn't received write yet)
Time 2: User sees old email, confused
Time 3: Follower receives write
Time 4: User refreshes, sees new email, more confused
Solutions:
- Read from leader for recently-written data
- Track write timestamps; only read from followers caught up to that point
- Sticky sessions: route user to the same follower consistently
Anomaly 2: Monotonic Reads
User makes two read requests. First hits follower A (caught up). Second hits follower B (lagging). User sees data go "backwards in time."
Read 1 → Follower A: Returns version 5
Read 2 → Follower B: Returns version 3
User: "Where did my data go?"
Solution: Sticky sessions or reading from the same replica within a session.
Anomaly 3: Consistent Prefix Reads
Two causally related writes (A, then B) may appear in different order on different followers.
Leader: Write A (9:00), Write B (9:01)
Follower 1: Receives B (9:02), Receives A (9:03)
Read from Follower 1 at 9:02:30:
- Sees B but not A
- B depends on A (e.g., "Reply to message A")
- User sees reply to non-existent message
Solution: Ensure causally related writes go through the same partition/leader, or use logical clocks.
3.3 Measuring and Monitoring Lag
Every replicated system should expose:
- Replication lag in seconds/bytes: How far behind is each follower?
- Lag histogram: Distribution of lag across followers
- Lag alerts: Page when lag exceeds acceptable threshold
-- PostgreSQL: Check replication lag
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;
Chapter 4: Consistency Models
Different applications need different consistency guarantees. Choosing the wrong one means either poor performance (too strong) or confusing bugs (too weak).
4.1 Strong Consistency
Every read returns the most recent write. All clients see the same data at the same time.
How to achieve it: Synchronous replication, or read from leader only.
Cost: Higher latency (coordination required), lower availability (leader failure blocks reads).
When you need it:
- Financial transactions
- Inventory counts (avoid overselling)
- Any case where stale reads cause real harm
4.2 Eventual Consistency
If no new writes occur, all replicas will eventually converge to the same value. No guarantee about when.
How to achieve it: Asynchronous replication with no special handling.
Cost: Clients may see stale data, out-of-order updates, or temporary inconsistencies.
When it's acceptable:
- Social media feeds (seeing a post 30 seconds late is fine)
- Analytics dashboards (approximate numbers are acceptable)
- Caches (by definition, allowed to be stale)
4.3 Read-Your-Writes Consistency
A user always sees their own writes immediately. Other users may see stale data.
How to achieve it:
- Route user's reads to the leader for recently-written data
- Track user's last write timestamp; only read from caught-up replicas
When you need it:
- User profiles (user updates settings, expects to see them)
- Document editing (author's changes should appear immediately)
- Any self-referential read after write
4.4 Monotonic Reads
Once a user sees a version of data, they never see an older version on subsequent reads.
How to achieve it: Sticky sessions (same replica per user session).
When you need it:
- Conversation threads (messages shouldn't disappear and reappear)
- Activity timelines (events shouldn't reorder)
4.5 Causal Consistency
If event A causes event B, everyone sees A before B. Concurrent events may be seen in different orders.
How to achieve it: Logical clocks (vector clocks, Lamport timestamps) to track causality.
When you need it:
- Messaging (reply should appear after original message)
- Collaborative editing (operations that depend on each other must apply in order)
4.6 Choosing Your Consistency Level
| Use Case | Recommended Level | Why |
|---|---|---|
| Bank balance | Strong | Incorrect balance = real money problems |
| E-commerce inventory | Strong | Overselling costs money and trust |
| Social media feed | Eventual | Late posts don't cause harm |
| User profile (to user) | Read-your-writes | User expects to see their changes |
| User profile (to others) | Eventual | Others can wait a bit |
| Chat messages | Causal | Replies must follow messages |
| View counter | Eventual | Approximate is fine |
| URL shortener reads | Eventual | 1-second stale is fine |
| URL shortener writes | Strong | URL must exist before sharing |
Part II: The Design Challenge
Chapter 5: Extending the URL Shortener with Replication
Yesterday, you designed a partitioned URL shortener. Today, we add replication for availability and read scaling.
5.1 Current State (From Day 1)
┌─────────────────────────────────────┐
│ App Servers │
└───────────────┬─────────────────────┘
│
┌──────▼──────┐
│ Cache │
│ (Redis) │
└──────┬──────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│Partition 0│ │Partition 1│ │Partition 2│
└───────────┘ └───────────┘ └───────────┘
Each partition is a single PostgreSQL instance. If Partition 1 dies, all URLs mapped to it are unavailable.
5.2 Adding Read Replicas
For each partition, we add read replicas:
┌─────────────┐
│ Cache │
└──────┬──────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Partition 0 │ │ Partition 1 │ │ Partition 2 │
│ Primary │ │ Primary │ │ Primary │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
▼ ▼ ▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐┌─────────┐ ┌─────────┐
│Replica 0│ │Replica 1│ │Replica 0│ │Replica 1││Replica 0│ │Replica 1│
└─────────┘ └─────────┘ └─────────┘ └─────────┘└─────────┘ └─────────┘
5.3 Write Path
Writes always go to the primary:
def create_short_url(short_code: str, original_url: str) -> URLMapping:
partition = get_partition(short_code)
primary = get_primary_connection(partition)
result = primary.execute(
"INSERT INTO urls (short_code, original_url, created_at) "
"VALUES (%s, %s, NOW()) RETURNING created_at",
(short_code, original_url)
)
return URLMapping(
short_code=short_code,
original_url=original_url,
created_at=result.created_at
)
5.4 Read Path
Reads can go to any replica:
def redirect(short_code: str) -> Optional[str]:
# Try cache first
cached = cache.get(f"url:{short_code}")
if cached:
return cached
# Cache miss - read from replica
partition = get_partition(short_code)
replica = get_random_replica(partition) # Load balance across replicas
result = replica.execute(
"SELECT original_url FROM urls WHERE short_code = %s",
(short_code,)
)
if result:
cache.setex(f"url:{short_code}", 3600, result.original_url)
return result.original_url
return None
5.5 The Replication Lag Problem
Scenario: User creates a short URL and immediately shares it.
Time 0.000s: User creates short URL "abc123" → Written to Primary 1
Time 0.001s: Primary 1 acknowledges write to user
Time 0.002s: User shares link with friend
Time 0.003s: Friend clicks link → Request routed to Replica 1
Time 0.004s: Replica 1 queries for "abc123" → NOT FOUND (replica lag ~50ms)
Time 0.005s: Friend sees 404 error
Time 0.050s: Replica 1 receives replication of "abc123"
Time 0.100s: Friend retries → Works
The friend saw a 404 for a URL that exists. This is a read-your-writes violation—but for someone else's write.
5.6 Solutions for the Stale Read Problem
Solution 1: Synchronous Replication
Configure PostgreSQL for synchronous replication. Writes don't acknowledge until at least one replica confirms.
-- postgresql.conf on primary
synchronous_commit = on
synchronous_standby_names = 'replica1'
Trade-off: Every write is slower by one network round-trip to the replica. For a URL shortener doing 1,000 creates/second, adding 1ms per write might be acceptable.
Solution 2: Read from Primary for New URLs
After creating a URL, read from primary for a short window:
def create_short_url(short_code: str, original_url: str) -> URLMapping:
result = create_in_primary(short_code, original_url)
# Mark this URL as "just created" for 5 seconds
cache.setex(f"new:{short_code}", 5, "1")
# Also cache the URL itself
cache.setex(f"url:{short_code}", 3600, original_url)
return result
def redirect(short_code: str) -> Optional[str]:
cached = cache.get(f"url:{short_code}")
if cached:
return cached
partition = get_partition(short_code)
# Check if this is a newly created URL
if cache.exists(f"new:{short_code}"):
# Read from primary to ensure we see it
db = get_primary_connection(partition)
else:
# Safe to read from replica
db = get_random_replica(partition)
result = db.execute(...)
...
Trade-off: Slightly more complex logic. Primary gets more reads for new URLs. But new URLs are also likely to be hot (just shared), so they'd probably end up cached quickly anyway.
Solution 3: Write-Through Cache
Cache the URL at creation time, not at first read:
def create_short_url(short_code: str, original_url: str) -> URLMapping:
result = create_in_primary(short_code, original_url)
# Immediately cache - redirects will hit cache, not database
cache.setex(f"url:{short_code}", 3600, original_url)
return result
Trade-off: The cache is now in the write path. If Redis is slow or down, URL creation is affected. But for a URL shortener, this is probably acceptable since Redis is already critical infrastructure.
5.7 Failover Design
What happens when a primary dies?
Automatic Failover
Use a tool like Patroni (for PostgreSQL) or orchestration built into your database:
- Primary stops responding to heartbeats
- Followers detect primary is gone (timeout: 10-30 seconds)
- Followers elect a new primary (consensus among remaining nodes)
- Connection poolers (PgBouncer) are updated to point to new primary
- Old primary, if it recovers, joins as a follower
Before failover:
Primary ──▶ Replica1, Replica2
Primary dies:
After failover:
Replica1 (promoted) ──▶ Replica2
Old Primary (if recovered, joins as replica)
Handling Writes During Failover
During the failover window (10-60 seconds), writes fail. Options:
-
Return errors: Tell users "please retry." Simple but bad UX.
-
Queue writes: Accept writes into a queue, apply when new primary is elected. Complex, risk of inconsistency.
-
Fast failover: Optimize detection and promotion to minimize window. Best practical approach.
For a URL shortener, option 1 is probably fine. Users can retry creating a URL. It's not like a payment that might double-charge.
Handling Reads During Failover
If the primary dies but replicas are healthy, reads continue to work (assuming you don't require reading from primary).
This is a major benefit of read replicas: read availability during primary failures.
Chapter 6: Replication in Multi-Region Deployments
6.1 The Latency Problem
Your URL shortener is successful. Users are global. But all infrastructure is in us-east-1.
User in Tokyo → us-east-1 → 150ms round trip
User in London → us-east-1 → 80ms round trip
User in São Paulo → us-east-1 → 120ms round trip
For redirects (the hot path), this latency is painful. Users expect near-instant redirects.
6.2 Read Replicas in Multiple Regions
Deploy read replicas in each major region:
┌─────────────────────┐
│ us-east-1 │
│ Primary Cluster │
└──────────┬──────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ eu-west-1 │ │ ap-northeast-1 │ │ sa-east-1 │
│ Read Replicas │ │ Read Replicas │ │ Read Replicas │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Europe Tokyo São Paulo
Reads: Served from local region (~10ms) Writes: Go to us-east-1 (~80-150ms)
6.3 Dealing with Cross-Region Replication Lag
Cross-region replication has higher lag than in-region (50-200ms typically).
The problem intensifies: User in Tokyo creates URL, shares with friend in Tokyo. Friend clicks. Local Tokyo replica doesn't have the URL yet.
Solutions:
-
Write-through to local cache: When URL is created, synchronously populate the Tokyo cache.
-
Route creator's reads to primary region: For a few seconds after creation, route that user's reads to us-east-1.
-
Accept the edge case: For a URL shortener, the friend can retry in 500ms. It's not ideal but might be acceptable.
6.4 Multi-Region Active-Active
For write-heavy workloads or true global availability, you might want active-active:
┌─────────────────┐ ┌─────────────────┐
│ us-east-1 │◀──────────▶│ eu-west-1 │
│ Primary │ Bi-dir │ Primary │
│ (Americas) │ Repl. │ (Europe) │
└─────────────────┘ └─────────────────┘
Users in Americas write to us-east-1. Users in Europe write to eu-west-1. Changes replicate both directions.
The conflict problem: Same URL created in both regions simultaneously? With a URL shortener, this is unlikely (short codes are random), but you'd need conflict resolution if it happens.
Simpler alternative for URL shorteners: Keep single-primary, accept the write latency. URL creation is infrequent compared to redirects. A 150ms create is acceptable if redirects are 10ms.
Part III: Advanced Topics
Chapter 7: Replication Internals
Understanding how replication works under the hood helps you debug issues and make informed decisions.
7.1 Write-Ahead Log (WAL) Replication
Most databases (PostgreSQL, MySQL, MongoDB) use log-based replication:
- Every write is first recorded in a sequential log (WAL)
- The log is shipped to replicas
- Replicas apply the log entries to their data
Primary:
Write ─▶ WAL ─▶ Apply to data
│
└──▶ Send to replicas
Replica:
Receive WAL ─▶ Apply to data
Why WAL?
- Sequential writes are fast (disk optimization)
- Log captures exact changes (no interpretation needed)
- Replicas can catch up by replaying log from any point
7.2 Logical vs Physical Replication
Physical replication: Ship the raw WAL bytes. Replica is byte-for-byte identical to primary. Fast, but tied to same database version.
Logical replication: Ship logical changes (INSERT/UPDATE/DELETE with values). More flexible, can replicate between different versions or even different databases. Slightly slower.
Physical: "Write these bytes to page 1234 at offset 567"
Logical: "INSERT INTO users (id, name) VALUES (1, 'Alice')"
7.3 Replication Slots and Catchup
What if a replica goes offline for an hour?
Without replication slots: Primary might delete old WAL segments (to reclaim disk). Replica comes back and can't catch up—needs full resync.
With replication slots: Primary retains WAL needed by each replica. Replica can always catch up, but disk usage grows if replica is down too long.
# Monitor replication slot disk usage
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
FROM pg_replication_slots;
Chapter 8: Handling Failover Correctly
Failover sounds simple: promote a replica when the primary dies. In practice, it's full of edge cases.
8.1 The Split-Brain Problem
Network partition isolates the primary:
┌─────────────────────────────────────────┐
│ Network Partition │
│ │
┌──────┴──────┐ ┌────────┴────────┐
│ Primary │ X X X │ Replicas │
│ (isolated) │ │ + Application │
└─────────────┘ └─────────────────┘
│ │
Clients A Clients B
- Clients A continue writing to the old primary
- Replicas can't reach primary, promote a new one
- Clients B write to the new primary
- Network heals: two primaries with conflicting data
8.2 Fencing: The Solution to Split-Brain
Fencing ensures the old primary can't accept writes after failover:
-
STONITH (Shoot The Other Node In The Head): Literally power off the old primary via IPMI/BMC.
-
Storage fencing: Revoke the old primary's access to shared storage.
-
Network fencing: Reconfigure network rules to block the old primary.
-
Lease-based: Primary must regularly renew a lease (in ZooKeeper/etcd). If it can't reach the lease store, it stops accepting writes.
# Lease-based fencing pseudocode
class Primary:
def __init__(self):
self.lease = self.acquire_lease()
def accept_write(self, write):
if not self.lease.is_valid():
raise Exception("Lost leadership, rejecting write")
self.execute(write)
def background_loop(self):
while True:
try:
self.lease.renew()
except LeaseRenewalFailed:
self.demote_to_read_only()
sleep(lease_ttl / 3)
8.3 Failover Runbook
A real failover runbook for our URL shortener:
FAILOVER PROCEDURE: Partition X Primary Failure
1. DETECT
- Alert: "Primary X not responding to health checks for 30 seconds"
- Verify: Check if network issue or actual server failure
2. DECIDE
- If network issue: Do NOT failover, wait for network recovery
- If server failure: Proceed with failover
3. FENCE THE OLD PRIMARY
- Remove from load balancer
- If possible, power off or isolate network
4. SELECT NEW PRIMARY
- Choose replica with least replication lag
- Verify replica is healthy
5. PROMOTE
- Run: pg_ctl promote -D /data/postgres
- Verify: SELECT pg_is_in_recovery(); -- should return false
6. RECONFIGURE
- Update connection strings / PgBouncer
- Remaining replicas point to new primary
7. VERIFY
- Test write: Create test URL
- Test read: Query for test URL
- Monitor replication from new primary to replicas
8. CLEANUP
- When old primary recovers: reinitialize as replica
- Post-incident review within 24 hours
ESTIMATED DOWNTIME: 30-60 seconds (writes only; reads continue)
Part IV: Discussion and Trade-offs
Chapter 9: The Hard Questions
These are the questions you should be asking (and answering) in your design sessions.
9.1 "User creates URL, immediately shares it, friend clicks—what happens with async replication?"
This is the core replication lag problem. Your answer should include:
-
Acknowledge the problem: Yes, with async replication, the friend might get a 404.
-
Quantify the risk: Replication lag is typically <100ms. The friend would have to click within 100ms of the share—unlikely but possible.
-
Propose mitigations:
- Write-through cache at creation time (friend hits cache, not replica)
- Semi-sync replication (wait for one replica before acknowledging)
- Accept the edge case (friend retries, it works)
-
Make a recommendation: "For a URL shortener, I'd use write-through cache. It's simple, and we already have Redis in the architecture. The URL is cached before the user even finishes sharing."
9.2 "When would you accept stale reads? When not?"
Accept stale reads:
- View counters, like counts (approximate is fine)
- News feeds (seeing a post 30 seconds late doesn't matter)
- Product listings (prices don't change every second)
- Analytics dashboards (real-time accuracy isn't required)
Reject stale reads:
- Account balances (stale balance → wrong financial decisions)
- Inventory counts (stale inventory → overselling)
- Permission checks (stale permissions → security hole)
- User authentication state (stale logout → security issue)
The URL shortener case:
- URL resolution: Can be eventually consistent. A URL created 1 second ago can tolerate <100ms lag.
- URL creation: Must be strongly consistent. User needs confirmation that the URL exists.
- URL deletion: Tricky. If we delete but replicas still serve it, is that a problem? For most URL shorteners, yes—the user deleted it for a reason.
9.3 "What replication factor would you use?"
Replication factor = number of copies of each piece of data
Common choices:
- RF=1: No replication. Don't do this in production.
- RF=2: Can survive 1 node failure. Common for cost-sensitive workloads.
- RF=3: Can survive 1 failure with no degradation (quorum still possible). Standard choice.
- RF=5: Can survive 2 failures. For highly critical data.
For the URL shortener: RF=3 is appropriate. It's standard for production workloads, allows maintenance without risk, and the storage overhead is acceptable.
9.4 "How do you handle a replica that's fallen hours behind?"
Detection: Monitor replication lag. Alert if lag exceeds threshold (e.g., 1 minute).
Diagnosis: Why is it behind?
- Network issue: Fix network, let it catch up.
- Slow disk: Replica can't apply changes fast enough.
- Long-running query: Blocking replication on the replica.
Recovery options:
- Let it catch up: If the primary is retaining WAL, the replica will eventually catch up.
- Remove from rotation: Stop sending reads to this replica until it catches up.
- Re-sync from scratch: If too far behind or WAL is gone, rebuild the replica from a backup.
For URL shortener: Remove from read rotation immediately. A replica hours behind is serving stale data. If it's more than 1 hour behind, consider rebuilding—faster than catching up.
Chapter 10: Session Summary
What You Should Know Now
After this session, you should be able to:
- Explain leader-follower, multi-leader, and leaderless architectures with trade-offs
- Design a replication strategy based on consistency and availability requirements
- Identify replication lag anomalies and their solutions
- Handle failover scenarios with explicit fencing strategies
- Extend a partitioned system with replication (the URL shortener)
Key Trade-offs to Remember
| Decision | Trade-off |
|---|---|
| Sync vs Async replication | Consistency vs Write latency |
| More replicas | Better read scaling vs Higher write replication cost |
| Leader-follower vs Multi-leader | Simplicity vs Multi-region write latency |
| Strong vs Eventual consistency | Correctness vs Performance/Availability |
| Higher replication factor | Better durability vs Storage and write cost |
Questions to Ask in Every Design
- What consistency level does this data actually need?
- What's the acceptable replication lag?
- What happens during failover? How long is the outage?
- Can readers tolerate stale data? For how long?
- What's the strategy for a replica that falls far behind?
Part V: Interview Questions and Answers
Chapter 11: Real-World Interview Scenarios
11.1 Conceptual Questions
Question 1: "Explain the difference between synchronous and asynchronous replication."
Interviewer's Intent: Testing fundamental understanding.
Strong Answer:
"In synchronous replication, the primary waits for at least one replica to confirm the write before acknowledging to the client. This guarantees that if the primary dies, no acknowledged writes are lost. The cost is latency—every write incurs a network round-trip to the replica.
In asynchronous replication, the primary acknowledges immediately and replicates in the background. Writes are faster, but if the primary dies before replicating, those writes are lost.
Most production systems use semi-synchronous: wait for one replica, replicate to others asynchronously. This balances durability with performance. PostgreSQL with synchronous_commit = on and one sync standby is a common configuration."
Question 2: "What is replication lag and why does it matter?"
Interviewer's Intent: Testing understanding of practical implications.
Strong Answer:
"Replication lag is the delay between a write on the primary and that write appearing on a replica. In normal conditions, it's milliseconds. Under load or with network issues, it can grow to seconds or minutes.
It matters because reads from lagging replicas return stale data. This causes anomalies: a user updates their profile, refreshes, and sees the old version because their read hit a lagging replica.
There are specific consistency violations to watch for:
- Read-your-writes: User doesn't see their own recent write
- Monotonic reads: User sees data go backwards (fresher replica, then lagging one)
- Consistent prefix: Causally related writes appear out of order
Mitigations include sticky sessions, reading from primary for recent writes, and write-through caching."
Question 3: "What is split-brain and how do you prevent it?"
Interviewer's Intent: Testing understanding of distributed systems failure modes.
Strong Answer:
"Split-brain occurs when a network partition causes two nodes to both believe they're the primary. Each accepts writes, and when the partition heals, you have conflicting data.
Prevention requires fencing—ensuring the old primary can't accept writes after a new primary is elected. Common approaches:
- STONITH—literally power off the old primary via out-of-band management
- Lease-based leadership—the primary must regularly renew a lease in a consensus system like ZooKeeper. If it can't reach the lease store, it stops accepting writes
- Storage fencing—revoke the old primary's access to shared storage
I'd use lease-based fencing for most applications. It's less invasive than STONITH and works well with cloud infrastructure where IPMI access isn't always available."
Question 4: "Explain the CAP theorem in practical terms."
Interviewer's Intent: Testing ability to translate theory to practice.
Strong Answer:
"CAP says you can't have Consistency, Availability, and Partition tolerance simultaneously. Since network partitions will happen, you're really choosing between consistency and availability during a partition.
In practice, it's not binary—it's a spectrum. For example:
- During normal operation, you can have both C and A
- During a partition, you choose: reject writes (consistent but unavailable) or accept writes (available but potentially inconsistent)
Modern systems let you tune this per operation. DynamoDB lets you choose strongly consistent or eventually consistent reads. Cassandra lets you set quorum levels.
For the URL shortener example: I'd choose availability for reads (serve from any replica, even if stale) but consistency for URL creation (write must be durable before acknowledging). Different operations, different CAP positions."
11.2 Design Questions
Question 5: "Design the replication strategy for a global user profile service."
Interviewer's Intent: Testing end-to-end replication design.
Strong Answer:
"First, let me understand the access patterns. User profiles are read-heavy—users view profiles constantly, update rarely. Reads should be low-latency globally.
Architecture: Leader-follower with regional read replicas.
Primary (us-east-1)
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
EU Replicas Asia Replicas SA Replicas
Writes go to the primary in us-east-1. Users accept the latency for updates since they're infrequent.
Reads go to the nearest regional replica. Replication is asynchronous—users can tolerate seeing their profile update a few hundred milliseconds late.
For read-your-writes consistency: when a user updates their profile, I'd write-through to a global cache (something like Redis Global Datastore or a CDN cache). The next read hits the cache, not the lagging replica.
For cross-region failover: if us-east-1 goes down, I'd fail over to eu-west-1. This requires promotion of a replica to primary and reconfiguring the other replicas. Failover time would be 30-60 seconds with automation.
Replication factor: 2 replicas per region (total of 7 nodes including primary). This gives fault tolerance within each region and allows maintenance without risk."
Question 6: "How would you handle replication for a financial ledger?"
Interviewer's Intent: Testing understanding of strong consistency requirements.
Strong Answer:
"Financial ledgers are the poster child for strong consistency. An incorrect balance or double-spend is unacceptable. I'd design very differently than the profile service.
Replication strategy: Synchronous replication to at least one replica before acknowledging any write. I'd use a consensus protocol (Raft, Paxos) rather than simple leader-follower to guarantee no data loss during failover.
Write ──▶ Primary ──▶ Sync Replica 1 ──▶ ACK
│
└──▶ Async Replica 2, 3 (for read scaling)
Consistency: All reads go to the primary or sync replicas only. I'd never serve a balance from an async replica.
Availability trade-off: During a partition, I'd reject writes rather than risk inconsistency. Financial correctness > availability. Users see an error message, but their money is safe.
Failover: Use consensus-based election. The new leader is guaranteed to have all committed transactions. Fencing is critical—I'd use lease-based fencing with a short TTL (5-10 seconds).
For audit compliance: Keep the replication log (WAL) immutable and archived. This provides a complete history of all transactions for regulatory review.
The key insight: this is exactly the opposite of the profile service. Strong consistency, sacrifice availability, synchronous replication. The requirements drive fundamentally different designs."
11.3 Scenario-Based Questions
Question 7: "Your replica is 2 hours behind. Walk me through how you'd handle this."
Interviewer's Intent: Testing operational skills.
Strong Answer:
"First, I'd remove it from the read rotation immediately. A replica 2 hours behind is serving data that's 2 hours stale—that's unacceptable for most applications.
Then diagnosis:
-
Check network: Is there a network issue between primary and replica? I'd check packet loss, latency, bandwidth utilization.
-
Check replica health: Is the replica CPU/IO bound? Is something blocking replication apply (long-running queries, locks)?
-
Check primary: Is the primary WAL generation unusually high? Maybe there's a bulk load or migration running.
For recovery, it depends on the cause:
-
If it's a temporary issue (network blip), and the primary is retaining WAL, I'd let it catch up. Monitor the lag—it should decrease steadily.
-
If the replica can't keep up (undersized hardware), I'd either upgrade it or rebuild on better hardware.
-
If the primary has already deleted the required WAL (no replication slots), I'd need to rebuild the replica from a fresh backup.
The decision point: if catching up will take longer than rebuilding (say, lag > 24 hours), rebuilding is often faster with modern tooling like pg_basebackup.
Throughout, I'd communicate status to the team and update our incident channel. A degraded replica affects our read capacity and fault tolerance."
Question 8: "Your primary database just died. Walk me through the failover process."
Interviewer's Intent: Testing incident response skills.
Strong Answer:
"Assuming we have automated failover configured (and we should), here's what happens and what I'd verify:
Automated steps:
- Health checks detect primary is unresponsive (typically 10-30 second timeout)
- Failover coordinator (Patroni, RDS, etc.) initiates failover
- Sync replica is promoted to primary
- Connection poolers/load balancers are updated
- Other replicas are reconfigured to follow the new primary
My immediate actions:
- Acknowledge the alert, join incident channel
- Verify failover completed: Can we write to the new primary?
- Check application health: Are services reconnecting successfully?
- Monitor for any data issues: Did we lose any writes?
If automated failover didn't trigger:
- Manually verify primary is actually down (not just monitoring failure)
- Fence the old primary (remove from network/load balancer)
- Identify the most up-to-date replica
- Promote it manually:
pg_ctl promoteor equivalent - Update connection configuration
- Verify writes work
Post-incident:
- Investigate why primary died (hardware? OOM? disk full?)
- When old primary is recovered, reinitialize it as a replica
- Write up incident report with timeline and learnings
- If automated failover failed, fix the automation
Expected timeline: 30-60 seconds for automated failover, 5-15 minutes for manual if automation fails."
Question 9: "How would you test your replication and failover setup?"
Interviewer's Intent: Testing proactive reliability thinking.
Strong Answer:
"Testing replication should happen regularly, not just after setup.
Continuous monitoring:
- Replication lag metrics on all replicas (alert if > threshold)
- Replica health checks (is the replica applying WAL?)
- Primary WAL generation rate (unusual spike might indicate issues)
Regular testing (monthly):
- Planned failover drill: Deliberately promote a replica, verify applications handle it gracefully, then fail back.
- Chaos engineering: Kill the primary unexpectedly (in staging), verify automated failover works.
- Lag simulation: Introduce network latency to a replica, verify monitoring detects it and the replica is removed from rotation.
Pre-production validation:
- Before any replication config change, test in staging
- Verify both automated and manual failover work
- Test the scenario where automated failover fails and you need manual intervention
Application-level testing:
- Verify applications reconnect after failover (connection string changes, timeouts)
- Test that applications handle brief write unavailability gracefully
- Verify no data corruption after failover
Documentation:
- Runbook for manual failover should be tested annually by someone who wasn't involved in writing it
- If they can't follow it, the runbook is broken
The goal is confidence: when a real failover happens at 3 AM, you know it will work because you've tested it repeatedly."
11.4 Deep-Dive Questions
Question 10: "Compare replication in PostgreSQL, MySQL, and MongoDB."
Interviewer's Intent: Testing breadth of knowledge.
Strong Answer:
"PostgreSQL uses streaming replication by default. WAL segments are shipped to replicas over a persistent connection. It supports synchronous replication (wait for N replicas), replication slots (ensure WAL is retained for slow replicas), and logical replication (for cross-version or selective replication).
Strengths: Very robust, battle-tested. Synchronous replication is straightforward. Logical replication allows flexible topologies. Weaknesses: Failover isn't built-in—you need external tools like Patroni or Stolon.
MySQL uses binary log replication. Similar concept to WAL, but MySQL has more replication modes: statement-based (replicates SQL), row-based (replicates actual data changes), or mixed. MySQL Group Replication adds consensus-based replication for automatic failover.
Strengths: Flexible replication formats. Group Replication provides built-in automatic failover. Weaknesses: Historically more complex to configure correctly. GTID (Global Transaction IDs) make it better, but legacy setups can be tricky.
MongoDB uses replica sets: a primary and multiple secondaries with automatic failover via Raft consensus. Writes go to primary; reads can go to secondaries with various read preferences.
Strengths: Built-in automatic failover, no external tools needed. Read preference is configurable per query. Weaknesses: Write concern configuration can be confusing. Default settings prioritize availability over durability.
For our URL shortener, I'd likely use PostgreSQL with Patroni for automatic failover. It's what I know best, and the tooling is mature. But MongoDB's built-in replica sets would also work well—simpler to set up, automatic failover out of the box."
Question 11: "What is quorum-based replication and when would you use it?"
Interviewer's Intent: Testing understanding of consistency mechanisms.
Strong Answer:
"Quorum replication requires a minimum number of nodes to agree before considering an operation complete. With N nodes, you configure W (write quorum) and R (read quorum). If W + R > N, reads always see the latest write because at least one node participates in both.
Example with N=5:
- W=3, R=3: Majority for both reads and writes. Strong consistency.
- W=1, R=5: Fast writes, expensive reads. All nodes must respond to read.
- W=5, R=1: Expensive writes (all nodes), fast reads (any single node).
Cassandra and DynamoDB use this model. Common settings:
- QUORUM (W=majority, R=majority): Balanced consistency/performance
- ONE (W=1, R=1): Fastest, but eventual consistency
- ALL (W=all, R=all): Strongest, but any node failure blocks operations
When to use it:
- When you need tunable consistency per operation
- When you want availability during partial failures (quorum can succeed without all nodes)
- For distributed databases without a single leader
Trade-offs:
- Higher latency (waiting for multiple nodes)
- Tail latency is the slowest node in the quorum
- More complex client logic
For the URL shortener, I probably wouldn't use quorum-based replication. Leader-follower is simpler and sufficient. But if I were building a globally distributed key-value store where each region needs to accept writes, quorum-based approaches like Dynamo-style replication would be appropriate."
Chapter 12: Interview Preparation Checklist
Before your interview, make sure you can:
Concepts
- Explain sync vs async replication and their trade-offs
- Describe leader-follower, multi-leader, and leaderless architectures
- Define replication lag and its anomalies (read-your-writes, monotonic reads)
- Explain split-brain and fencing mechanisms
Design Skills
- Add replication to a partitioned system design
- Choose consistency levels based on requirements
- Design for multi-region deployment
- Handle failover scenarios explicitly
Operational Knowledge
- Walk through a failover runbook
- Diagnose and resolve a lagging replica
- Test replication and failover procedures
- Monitor replication health
Real Systems
- Know replication in PostgreSQL, MySQL, or MongoDB (at least one well)
- Understand quorum-based systems (Cassandra, DynamoDB)
- Cite real-world examples (how does X company handle this?)
Exercises
Exercise 1: Consistency Level Selection
For each scenario, choose the appropriate consistency level and justify:
- Twitter-like feed: User posts a tweet, expects to see it immediately in their own feed
- Inventory count: E-commerce site showing "Only 3 left in stock!"
- Like counter: Showing "1.2M likes" on a viral post
- Bank transfer: Moving money between accounts
- Password change: User resets password, tries to log in immediately
Exercise 2: Failover Design
Design the failover procedure for a three-node PostgreSQL cluster (1 primary, 2 replicas) running a payment processing system. Consider:
- Detection time
- Fencing mechanism
- Promotion criteria
- Application impact
- Data loss risk
Exercise 3: Replication Lag Mitigation
Your application has a flow: user submits a form, you save to the database, then redirect to a "success" page that reads the data. Users are occasionally seeing "not found" errors. Design three different solutions with trade-offs for each.
Further Reading
- "Designing Data-Intensive Applications" Chapter 5: Replication (the definitive reference)
- Jepsen.io: Kyle Kingsbury's analyses of database consistency under failure
- GitHub's MySQL High Availability: How they handle replication and failover
- Stripe's Online Migrations: Maintaining consistency during schema changes
- Patroni Documentation: Automatic failover for PostgreSQL
Appendix: Code Reference
A.1 Replication Lag Monitoring
from dataclasses import dataclass
from typing import List, Optional
import time
@dataclass
class ReplicaStatus:
hostname: str
lag_bytes: int
lag_seconds: float
state: str # 'streaming', 'catchup', 'disconnected'
class ReplicationMonitor:
def __init__(self, primary_conn, alert_threshold_seconds: float = 30):
self.primary_conn = primary_conn
self.alert_threshold = alert_threshold_seconds
def get_replica_status(self) -> List[ReplicaStatus]:
"""Query primary for replication status."""
query = """
SELECT
client_addr as hostname,
pg_wal_lsn_diff(sent_lsn, replay_lsn) as lag_bytes,
EXTRACT(EPOCH FROM (now() - last_msg_receipt_time)) as lag_seconds,
state
FROM pg_stat_replication
"""
results = self.primary_conn.execute(query)
return [
ReplicaStatus(
hostname=r['hostname'],
lag_bytes=r['lag_bytes'],
lag_seconds=r['lag_seconds'],
state=r['state']
)
for r in results
]
def check_health(self) -> List[str]:
"""Return list of alerts for unhealthy replicas."""
alerts = []
replicas = self.get_replica_status()
for replica in replicas:
if replica.state != 'streaming':
alerts.append(f"{replica.hostname}: State is {replica.state}, not streaming")
elif replica.lag_seconds > self.alert_threshold:
alerts.append(f"{replica.hostname}: Lag is {replica.lag_seconds:.1f}s (threshold: {self.alert_threshold}s)")
return alerts
A.2 Read-Your-Writes Consistency Helper
import time
from typing import Optional
import redis
class ReadYourWritesRouter:
"""
Routes reads to primary for recently-written keys,
otherwise to replicas.
"""
def __init__(
self,
primary_conn,
replica_pool,
cache: redis.Redis,
stickiness_seconds: float = 5.0
):
self.primary = primary_conn
self.replicas = replica_pool
self.cache = cache
self.stickiness_seconds = stickiness_seconds
def record_write(self, key: str):
"""Call this after every write to track the key."""
# Store timestamp of write
self.cache.setex(
f"written:{key}",
int(self.stickiness_seconds),
str(time.time())
)
def get_connection_for_read(self, key: str):
"""
Returns primary connection if key was recently written,
otherwise returns a replica connection.
"""
written_at = self.cache.get(f"written:{key}")
if written_at:
# Recently written - use primary
return self.primary
else:
# Safe to use replica
return self.replicas.get_random()
def read(self, key: str, query: str, params: tuple) -> Optional[dict]:
"""Execute a read with appropriate routing."""
conn = self.get_connection_for_read(key)
return conn.execute(query, params)
def write(self, key: str, query: str, params: tuple) -> dict:
"""Execute a write and track it."""
result = self.primary.execute(query, params)
self.record_write(key)
return result
# Usage example
router = ReadYourWritesRouter(primary, replica_pool, redis_client)
# Write - goes to primary, tracked for 5 seconds
router.write(
key="user:123",
query="UPDATE users SET email = %s WHERE id = %s",
params=("new@example.com", 123)
)
# Immediate read - routed to primary because of recent write
result = router.read(
key="user:123",
query="SELECT * FROM users WHERE id = %s",
params=(123,)
)
# Read after 5 seconds - safe to go to replica
time.sleep(5)
result = router.read(...) # Goes to replica
A.3 Automatic Failover Detection
import time
import threading
from enum import Enum
from typing import Callable, Optional
class NodeState(Enum):
PRIMARY = "primary"
REPLICA = "replica"
FAILED = "failed"
UNKNOWN = "unknown"
class FailoverDetector:
"""
Monitors primary health and triggers failover callback
when primary is determined to be down.
"""
def __init__(
self,
primary_conn,
health_check_interval: float = 1.0,
failure_threshold: int = 3,
on_failover: Optional[Callable] = None
):
self.primary = primary_conn
self.interval = health_check_interval
self.threshold = failure_threshold
self.on_failover = on_failover
self.consecutive_failures = 0
self.state = NodeState.UNKNOWN
self._running = False
self._thread = None
def health_check(self) -> bool:
"""Returns True if primary is healthy."""
try:
result = self.primary.execute("SELECT 1", timeout=1.0)
return result is not None
except Exception:
return False
def _monitor_loop(self):
while self._running:
healthy = self.health_check()
if healthy:
self.consecutive_failures = 0
self.state = NodeState.PRIMARY
else:
self.consecutive_failures += 1
if self.consecutive_failures >= self.threshold:
self.state = NodeState.FAILED
if self.on_failover:
self.on_failover()
# Reset after triggering
self.consecutive_failures = 0
time.sleep(self.interval)
def start(self):
self._running = True
self._thread = threading.Thread(target=self._monitor_loop, daemon=True)
self._thread.start()
def stop(self):
self._running = False
if self._thread:
self._thread.join()
# Usage
def handle_failover():
print("Primary failed! Initiating failover...")
# 1. Fence old primary
# 2. Promote replica
# 3. Update connection strings
# 4. Alert on-call
detector = FailoverDetector(
primary_conn=primary,
health_check_interval=1.0,
failure_threshold=3,
on_failover=handle_failover
)
detector.start()
End of Day 2: Replication Trade-offs
Tomorrow: Day 3 — Rate Limiting at Scale. We'll design a distributed rate limiter that handles 10K requests/second across 50 servers, and confront the question: when a node fails, do you over-allow or under-allow?