Week 5 Preview: Consistency and Coordination
System Design Mastery Series
Welcome to Week 5
Last week, you mastered caching — patterns, invalidation, thundering herd protection, and multi-tier architectures. You learned how to make systems fast by strategically storing data closer to users.
But caching introduced a fundamental challenge: stale data.
THE CACHING TRADE-OFF
You accepted eventual consistency for performance:
├── Price cached for 30 seconds (might be stale)
├── Inventory cached for 30 seconds (might be stale)
├── Feed computed from cached data (might be stale)
└── CDN serves content that's 60 seconds old
This works for DISPLAY purposes...
But what about CRITICAL operations?
User A sees: "1 item in stock"
User B sees: "1 item in stock"
Both click "Buy Now" at the same time
Who gets the item?
What happens to the other order?
This week, we confront the hardest problems in distributed systems: consistency and coordination.
The Week Ahead
Theme: "Understand the Cost of Consistency"
WEEK 5 STRUCTURE
Day 1: Consistency Models
├── Strong vs Eventual vs Causal
├── What does "consistent" actually mean?
├── Read-your-writes, monotonic reads
└── CAP theorem in practice (not theory)
Day 2: Distributed Transactions — Saga Pattern
├── Why 2PC is avoided in microservices
├── Choreography vs Orchestration
├── Compensation and rollback
└── Handling compensation failures
Day 3: Saga Orchestration Deep Dive
├── State machines for workflows
├── Temporal/Cadence workflow engines
├── Durable execution guarantees
└── Versioning and evolution
Day 4: Conflict Resolution
├── Last-write-wins problems
├── Vector clocks and version vectors
├── CRDTs for automatic resolution
└── Merge strategies for business data
Day 5: Leader Election and Coordination
├── Why leader election is hard
├── Fencing tokens and split-brain
├── ZooKeeper, etcd, Consul patterns
└── Distributed locks in practice
Why This Matters Now
The Inventory Problem (Revisited)
In Week 4, we cached inventory with a 30-second TTL. But what happens at checkout?
THE OVERSELL PROBLEM
Timeline:
T+0.0s: Inventory = 1 (truth in database)
T+0.0s: Cache shows 1 (cached value)
T+0.1s: User A clicks "Buy" → Sees 1 in stock
T+0.2s: User B clicks "Buy" → Sees 1 in stock
T+0.3s: User A's order processes → Inventory = 0
T+0.4s: User B's order processes → Inventory = -1 ???
RESULT: We sold 2 items when we only had 1!
Options:
1. STRONG CONSISTENCY: Lock inventory during checkout
└── Slower, but correct
2. OPTIMISTIC CONCURRENCY: Check-and-set at commit
└── Faster, occasional failures
3. COMPENSATION: Detect oversell, cancel one order
└── Complex, bad user experience
Which one? Depends on your consistency requirements.
The Money Transfer Problem
DISTRIBUTED TRANSACTION CHALLENGE
Transfer $100 from Account A to Account B:
Step 1: Debit $100 from Account A ✓
Step 2: Credit $100 to Account B ✗ (Service B down!)
RESULT: $100 disappeared!
In a single database: ROLLBACK
In microservices: ???
This is why we need Sagas.
The Collaborative Editing Problem
CONFLICT RESOLUTION CHALLENGE
Google Docs scenario:
User A (offline): Types "Hello World" at position 0
User B (offline): Types "Goodbye" at position 0
Both sync at the same time.
Expected result: Both edits appear
Actual result with last-write-wins: One edit lost!
Who wins? How do you merge?
This is why we need CRDTs.
The Leader Election Problem
THE SPLIT-BRAIN NIGHTMARE
Distributed job scheduler:
├── Node A thinks it's the leader
├── Node B thinks it's the leader
└── Both schedule the same job!
Result: Job runs twice, data corrupted
Or worse:
├── Network partition heals
├── Both leaders have made different decisions
└── System state is inconsistent
How do you ensure exactly one leader?
This is why leader election is hard.
What You'll Build This Week
Day 1: Consistency-Aware Inventory System
# Preview: Consistency levels for different operations
class InventoryService:
async def check_availability(self, product_id: str) -> int:
"""
Display inventory count.
Eventual consistency is fine — cached value OK.
"""
return await self.cache.get(f"inventory:{product_id}")
async def reserve_inventory(self, product_id: str, quantity: int) -> bool:
"""
Reserve inventory for checkout.
STRONG consistency required — must read from primary.
"""
async with self.db.transaction(isolation="SERIALIZABLE"):
current = await self.db.fetch_one(
"SELECT quantity FROM inventory WHERE product_id = $1 FOR UPDATE",
product_id
)
if current['quantity'] >= quantity:
await self.db.execute(
"UPDATE inventory SET quantity = quantity - $2 WHERE product_id = $1",
product_id, quantity
)
return True
return False
Day 2: Order Processing Saga
# Preview: Saga pattern for order processing
class OrderSaga:
"""
Order processing with compensation.
Steps:
1. Reserve inventory
2. Charge payment
3. Create shipment
If any step fails, compensate previous steps.
"""
async def execute(self, order: Order) -> SagaResult:
# Step 1: Reserve inventory
reservation = await self.inventory.reserve(order.items)
if not reservation.success:
return SagaResult.failed("Insufficient inventory")
# Step 2: Charge payment
payment = await self.payment.charge(order.total)
if not payment.success:
# COMPENSATE: Release inventory
await self.inventory.release(reservation.id)
return SagaResult.failed("Payment failed")
# Step 3: Create shipment
shipment = await self.shipping.create(order)
if not shipment.success:
# COMPENSATE: Refund payment, release inventory
await self.payment.refund(payment.id)
await self.inventory.release(reservation.id)
return SagaResult.failed("Shipment failed")
return SagaResult.success(order_id=order.id)
Day 3: Workflow Orchestration
# Preview: Temporal-style workflow
@workflow.defn
class OrderWorkflow:
"""
Durable workflow for order processing.
Survives crashes, retries automatically,
maintains exactly-once semantics.
"""
@workflow.run
async def run(self, order: Order) -> OrderResult:
# Each activity is durable — survives crashes
reservation = await workflow.execute_activity(
reserve_inventory,
order.items,
start_to_close_timeout=timedelta(seconds=30)
)
try:
payment = await workflow.execute_activity(
charge_payment,
order.total,
start_to_close_timeout=timedelta(seconds=60)
)
except PaymentFailed:
# Compensation activity
await workflow.execute_activity(
release_inventory,
reservation.id
)
raise
# Continue with shipment...
Day 4: Shopping Cart Merge
# Preview: CRDT-based shopping cart
class ShoppingCartCRDT:
"""
Conflict-free shopping cart that works offline.
Uses OR-Set (Observed-Remove Set) for items.
Automatically merges without conflicts.
"""
def __init__(self):
self.items = ORSet() # CRDT set
def add_item(self, product_id: str, quantity: int):
"""Add item — always succeeds, never conflicts."""
self.items.add((product_id, quantity, unique_tag()))
def remove_item(self, product_id: str):
"""Remove item — only removes observed additions."""
for item in self.items.lookup(product_id):
self.items.remove(item)
def merge(self, other: 'ShoppingCartCRDT'):
"""
Merge two carts — automatic conflict resolution.
If User A adds item X on phone (offline)
And User B adds item Y on laptop (offline)
Both items appear after merge!
"""
self.items.merge(other.items)
Day 5: Distributed Lock with Fencing
# Preview: Safe distributed locking
class FencedLockService:
"""
Distributed lock with fencing tokens.
Prevents split-brain: If two nodes think they hold the lock,
only the one with the higher fencing token succeeds.
"""
async def acquire(self, resource: str, holder: str) -> Lock:
"""Acquire lock and get fencing token."""
lock = await self.redis.set(
f"lock:{resource}",
holder,
nx=True,
ex=30
)
if lock:
# Get monotonically increasing fencing token
token = await self.redis.incr(f"fence:{resource}")
return Lock(resource=resource, holder=holder, fence_token=token)
return None
async def execute_with_lock(
self,
resource: str,
operation: Callable,
fence_token: int
):
"""
Execute operation with fencing token.
Storage layer validates token — rejects stale operations.
"""
await self.storage.execute(
operation,
fence_token=fence_token # Storage validates this
)
Connecting to Previous Weeks
Building on Week 1-4 Foundations
CONCEPT EVOLUTION
Week 1 (Foundations):
├── Partitioning: Data spread across nodes
├── Replication: Copies for availability
└── Question raised: How do replicas stay in sync?
Week 2 (Failure-First):
├── Idempotency: Safe retries
├── Timeouts: Bounding uncertainty
└── Question raised: What if operation partially succeeds?
Week 3 (Messaging):
├── At-least-once: Messages may duplicate
├── Transactional outbox: Atomic publish
└── Question raised: How to maintain order across services?
Week 4 (Caching):
├── Eventual consistency: Stale data acceptable
├── Invalidation: Propagating changes
└── Question raised: When is eventual NOT enough?
Week 5 (This Week):
└── ANSWERS all these questions!
├── How replicas stay in sync → Consistency models
├── Partial success handling → Saga pattern
├── Order across services → Orchestration
└── When eventual isn't enough → Strong consistency
Key Concepts to Internalize
The Consistency Spectrum
CONSISTENCY MODELS SPECTRUM
STRONG EVENTUAL
CONSISTENCY CONSISTENCY
│ │
▼ ▼
┌────────┬─────────────┬────────────┬───────────┬────────┐
│Lineari-│ Sequential │ Causal │ Read-your-│Eventual│
│zable │ Consistency │ Consistency│ writes │ │
└────────┴─────────────┴────────────┴───────────┴────────┘
│ │ │ │ │
│ │ │ │ │
"One "All "Cause "See "Eventually
copy" see same before your own converges"
order" effect" writes"
Cost: HIGH ←────────────────────────────────→ LOW
Speed: SLOW ←────────────────────────────────→ FAST
The CAP Theorem (Practical View)
CAP IN PRACTICE
You don't "choose 2 of 3" at design time.
You choose per-operation at runtime.
Example: E-commerce platform
Operation: Browse products
├── Consistency needed: Low (eventual OK)
├── Availability needed: High (must work)
└── Decision: AP (cache reads, tolerate stale)
Operation: Place order
├── Consistency needed: High (no oversell)
├── Availability needed: Medium (brief unavailability OK)
└── Decision: CP (strong consistency, accept latency)
Operation: View order history
├── Consistency needed: Medium (read-your-writes)
├── Availability needed: High
└── Decision: Causal consistency
The Saga Pattern
SAGA vs 2PC
2PC (Two-Phase Commit):
├── Coordinator asks: "Ready to commit?"
├── All participants respond: "Yes"
├── Coordinator says: "Commit!"
├── All participants commit
│
├── Problem: Coordinator holding locks
├── Problem: Participant failure during commit
├── Problem: Blocking — everyone waits
└── Result: Rarely used in microservices
SAGA:
├── Execute T1 (has compensating C1)
├── Execute T2 (has compensating C2)
├── Execute T3 (has compensating C3)
│
├── If T3 fails:
│ ├── Execute C2 (undo T2)
│ └── Execute C1 (undo T1)
│
├── Advantage: No distributed locks
├── Advantage: Each step is independent
└── Result: Standard pattern in microservices
Interview Patterns This Week
Common Questions You'll Answer
CONSISTENCY QUESTIONS
Q: "How do you prevent overselling inventory?"
A: "Use optimistic concurrency with version checks, or
pessimistic locking with SELECT FOR UPDATE. The choice
depends on contention levels..."
Q: "How would you implement a money transfer across banks?"
A: "I'd use the Saga pattern with compensation. Debit first,
then credit. If credit fails, issue compensating refund..."
Q: "What happens if your distributed lock holder crashes?"
A: "The lock has a TTL and will expire. But the dangerous
case is if the holder is just slow — that's why we use
fencing tokens..."
Q: "How do you handle conflicts in offline-first apps?"
A: "Depends on the data. For counters, I'd use a CRDT like
G-Counter. For sets, an OR-Set. For arbitrary data,
we might need application-level merge logic..."
Q: "When would you use strong consistency over eventual?"
A: "For financial transactions, inventory reservations,
unique constraint enforcement. Anywhere that stale data
leads to business-critical errors..."
Preparing for the Week
Mindset Shift
FROM TO
──── ──
"Make it fast" → "Make it correct, then fast"
"Cache everything" → "Cache what can be stale"
"Retry on failure" → "Compensate on failure"
"One database" → "Consistency across services"
"Lock and update" → "Optimistic concurrency"
Key Questions to Keep in Mind
As you go through each day, ask:
- What consistency level does this operation need?
- What happens if this operation partially succeeds?
- How do we detect and resolve conflicts?
- What's the compensation strategy if we need to rollback?
- Is there a single leader, or can multiple nodes act?
Quick Reference: The Week's Tools
| Concept | Tools/Technologies |
|---|---|
| Consistency | PostgreSQL isolation levels, CockroachDB, Spanner |
| Sagas | Temporal, Cadence, AWS Step Functions, custom |
| Coordination | ZooKeeper, etcd, Consul, Redis (limited) |
| CRDTs | Riak, Redis CRDT types, Automerge, Yjs |
| Distributed Locks | Redlock, ZooKeeper recipes, etcd leases |
Daily Time Investment
RECOMMENDED SCHEDULE
Day 1: Consistency Models (90 minutes)
├── Core reading: 45 min
├── Implementation review: 30 min
└── Practice problems: 15 min
Day 2: Saga Pattern (90 minutes)
├── Core reading: 45 min
├── Build simple saga: 30 min
└── Interview practice: 15 min
Day 3: Orchestration (90 minutes)
├── Core reading: 40 min
├── Temporal patterns: 35 min
└── Compare approaches: 15 min
Day 4: Conflict Resolution (90 minutes)
├── Core reading: 45 min
├── CRDT implementation: 30 min
└── Practice problems: 15 min
Day 5: Leader Election (90 minutes)
├── Core reading: 45 min
├── Distributed lock patterns: 30 min
└── Failure scenario analysis: 15 min
What Success Looks Like
By the end of Week 5, you should be able to:
WEEK 5 SUCCESS CRITERIA
□ Explain the trade-offs between consistency models
□ Design a Saga for any multi-service transaction
□ Implement compensation logic for rollback scenarios
□ Choose between choreography and orchestration
□ Understand when to use CRDTs vs manual merge
□ Implement a distributed lock with fencing tokens
□ Explain split-brain and how to prevent it
□ Design systems that degrade gracefully under partition
□ Answer any consistency-related interview question
Let's Begin
Week 5 tackles the hardest problems in distributed systems. These concepts separate junior engineers who "make it work" from senior engineers who "make it correct."
After this week, you'll understand:
- Why distributed systems are fundamentally different from single-machine systems
- How to reason about consistency requirements for any operation
- The patterns that make complex distributed transactions possible
- How to coordinate multiple nodes safely
Day 1 starts with the foundation: Consistency Models. We'll learn what "consistent" actually means in distributed systems, and why it's more nuanced than you might think.
Ready to master consistency? Let's begin with Day 1: Consistency Models — understanding what consistency really means in distributed systems.