Himanshu Kukreja
0%
LearnSystem DesignWeek 5Consistency And Coordination Preview
Week Preview

Week 5 Preview: Consistency and Coordination

System Design Mastery Series


Welcome to Week 5

Last week, you mastered caching — patterns, invalidation, thundering herd protection, and multi-tier architectures. You learned how to make systems fast by strategically storing data closer to users.

But caching introduced a fundamental challenge: stale data.

THE CACHING TRADE-OFF

You accepted eventual consistency for performance:
├── Price cached for 30 seconds (might be stale)
├── Inventory cached for 30 seconds (might be stale)
├── Feed computed from cached data (might be stale)
└── CDN serves content that's 60 seconds old

This works for DISPLAY purposes...

But what about CRITICAL operations?

User A sees: "1 item in stock"
User B sees: "1 item in stock"
Both click "Buy Now" at the same time

Who gets the item?
What happens to the other order?

This week, we confront the hardest problems in distributed systems: consistency and coordination.


The Week Ahead

Theme: "Understand the Cost of Consistency"

WEEK 5 STRUCTURE

Day 1: Consistency Models
       ├── Strong vs Eventual vs Causal
       ├── What does "consistent" actually mean?
       ├── Read-your-writes, monotonic reads
       └── CAP theorem in practice (not theory)

Day 2: Distributed Transactions — Saga Pattern
       ├── Why 2PC is avoided in microservices
       ├── Choreography vs Orchestration
       ├── Compensation and rollback
       └── Handling compensation failures

Day 3: Saga Orchestration Deep Dive
       ├── State machines for workflows
       ├── Temporal/Cadence workflow engines
       ├── Durable execution guarantees
       └── Versioning and evolution

Day 4: Conflict Resolution
       ├── Last-write-wins problems
       ├── Vector clocks and version vectors
       ├── CRDTs for automatic resolution
       └── Merge strategies for business data

Day 5: Leader Election and Coordination
       ├── Why leader election is hard
       ├── Fencing tokens and split-brain
       ├── ZooKeeper, etcd, Consul patterns
       └── Distributed locks in practice

Why This Matters Now

The Inventory Problem (Revisited)

In Week 4, we cached inventory with a 30-second TTL. But what happens at checkout?

THE OVERSELL PROBLEM

Timeline:
  T+0.0s: Inventory = 1 (truth in database)
  T+0.0s: Cache shows 1 (cached value)
  
  T+0.1s: User A clicks "Buy" → Sees 1 in stock
  T+0.2s: User B clicks "Buy" → Sees 1 in stock
  
  T+0.3s: User A's order processes → Inventory = 0
  T+0.4s: User B's order processes → Inventory = -1 ???

RESULT: We sold 2 items when we only had 1!

Options:
  1. STRONG CONSISTENCY: Lock inventory during checkout
     └── Slower, but correct
  
  2. OPTIMISTIC CONCURRENCY: Check-and-set at commit
     └── Faster, occasional failures
  
  3. COMPENSATION: Detect oversell, cancel one order
     └── Complex, bad user experience
  
Which one? Depends on your consistency requirements.

The Money Transfer Problem

DISTRIBUTED TRANSACTION CHALLENGE

Transfer $100 from Account A to Account B:

Step 1: Debit $100 from Account A  ✓
Step 2: Credit $100 to Account B   ✗ (Service B down!)

RESULT: $100 disappeared!

In a single database: ROLLBACK
In microservices: ???

This is why we need Sagas.

The Collaborative Editing Problem

CONFLICT RESOLUTION CHALLENGE

Google Docs scenario:

User A (offline): Types "Hello World" at position 0
User B (offline): Types "Goodbye" at position 0

Both sync at the same time.

Expected result: Both edits appear
Actual result with last-write-wins: One edit lost!

Who wins? How do you merge?
This is why we need CRDTs.

The Leader Election Problem

THE SPLIT-BRAIN NIGHTMARE

Distributed job scheduler:
├── Node A thinks it's the leader
├── Node B thinks it's the leader
└── Both schedule the same job!

Result: Job runs twice, data corrupted

Or worse:
├── Network partition heals
├── Both leaders have made different decisions
└── System state is inconsistent

How do you ensure exactly one leader?
This is why leader election is hard.

What You'll Build This Week

Day 1: Consistency-Aware Inventory System

# Preview: Consistency levels for different operations

class InventoryService:
    async def check_availability(self, product_id: str) -> int:
        """
        Display inventory count.
        Eventual consistency is fine — cached value OK.
        """
        return await self.cache.get(f"inventory:{product_id}")
    
    async def reserve_inventory(self, product_id: str, quantity: int) -> bool:
        """
        Reserve inventory for checkout.
        STRONG consistency required — must read from primary.
        """
        async with self.db.transaction(isolation="SERIALIZABLE"):
            current = await self.db.fetch_one(
                "SELECT quantity FROM inventory WHERE product_id = $1 FOR UPDATE",
                product_id
            )
            if current['quantity'] >= quantity:
                await self.db.execute(
                    "UPDATE inventory SET quantity = quantity - $2 WHERE product_id = $1",
                    product_id, quantity
                )
                return True
            return False

Day 2: Order Processing Saga

# Preview: Saga pattern for order processing

class OrderSaga:
    """
    Order processing with compensation.
    
    Steps:
    1. Reserve inventory
    2. Charge payment
    3. Create shipment
    
    If any step fails, compensate previous steps.
    """
    
    async def execute(self, order: Order) -> SagaResult:
        # Step 1: Reserve inventory
        reservation = await self.inventory.reserve(order.items)
        if not reservation.success:
            return SagaResult.failed("Insufficient inventory")
        
        # Step 2: Charge payment
        payment = await self.payment.charge(order.total)
        if not payment.success:
            # COMPENSATE: Release inventory
            await self.inventory.release(reservation.id)
            return SagaResult.failed("Payment failed")
        
        # Step 3: Create shipment
        shipment = await self.shipping.create(order)
        if not shipment.success:
            # COMPENSATE: Refund payment, release inventory
            await self.payment.refund(payment.id)
            await self.inventory.release(reservation.id)
            return SagaResult.failed("Shipment failed")
        
        return SagaResult.success(order_id=order.id)

Day 3: Workflow Orchestration

# Preview: Temporal-style workflow

@workflow.defn
class OrderWorkflow:
    """
    Durable workflow for order processing.
    
    Survives crashes, retries automatically,
    maintains exactly-once semantics.
    """
    
    @workflow.run
    async def run(self, order: Order) -> OrderResult:
        # Each activity is durable — survives crashes
        reservation = await workflow.execute_activity(
            reserve_inventory,
            order.items,
            start_to_close_timeout=timedelta(seconds=30)
        )
        
        try:
            payment = await workflow.execute_activity(
                charge_payment,
                order.total,
                start_to_close_timeout=timedelta(seconds=60)
            )
        except PaymentFailed:
            # Compensation activity
            await workflow.execute_activity(
                release_inventory,
                reservation.id
            )
            raise
        
        # Continue with shipment...

Day 4: Shopping Cart Merge

# Preview: CRDT-based shopping cart

class ShoppingCartCRDT:
    """
    Conflict-free shopping cart that works offline.
    
    Uses OR-Set (Observed-Remove Set) for items.
    Automatically merges without conflicts.
    """
    
    def __init__(self):
        self.items = ORSet()  # CRDT set
    
    def add_item(self, product_id: str, quantity: int):
        """Add item — always succeeds, never conflicts."""
        self.items.add((product_id, quantity, unique_tag()))
    
    def remove_item(self, product_id: str):
        """Remove item — only removes observed additions."""
        for item in self.items.lookup(product_id):
            self.items.remove(item)
    
    def merge(self, other: 'ShoppingCartCRDT'):
        """
        Merge two carts — automatic conflict resolution.
        
        If User A adds item X on phone (offline)
        And User B adds item Y on laptop (offline)
        Both items appear after merge!
        """
        self.items.merge(other.items)

Day 5: Distributed Lock with Fencing

# Preview: Safe distributed locking

class FencedLockService:
    """
    Distributed lock with fencing tokens.
    
    Prevents split-brain: If two nodes think they hold the lock,
    only the one with the higher fencing token succeeds.
    """
    
    async def acquire(self, resource: str, holder: str) -> Lock:
        """Acquire lock and get fencing token."""
        lock = await self.redis.set(
            f"lock:{resource}",
            holder,
            nx=True,
            ex=30
        )
        if lock:
            # Get monotonically increasing fencing token
            token = await self.redis.incr(f"fence:{resource}")
            return Lock(resource=resource, holder=holder, fence_token=token)
        return None
    
    async def execute_with_lock(
        self,
        resource: str,
        operation: Callable,
        fence_token: int
    ):
        """
        Execute operation with fencing token.
        
        Storage layer validates token — rejects stale operations.
        """
        await self.storage.execute(
            operation,
            fence_token=fence_token  # Storage validates this
        )

Connecting to Previous Weeks

Building on Week 1-4 Foundations

CONCEPT EVOLUTION

Week 1 (Foundations):
├── Partitioning: Data spread across nodes
├── Replication: Copies for availability
└── Question raised: How do replicas stay in sync?

Week 2 (Failure-First):
├── Idempotency: Safe retries
├── Timeouts: Bounding uncertainty
└── Question raised: What if operation partially succeeds?

Week 3 (Messaging):
├── At-least-once: Messages may duplicate
├── Transactional outbox: Atomic publish
└── Question raised: How to maintain order across services?

Week 4 (Caching):
├── Eventual consistency: Stale data acceptable
├── Invalidation: Propagating changes
└── Question raised: When is eventual NOT enough?

Week 5 (This Week):
└── ANSWERS all these questions!
    ├── How replicas stay in sync → Consistency models
    ├── Partial success handling → Saga pattern
    ├── Order across services → Orchestration
    └── When eventual isn't enough → Strong consistency

Key Concepts to Internalize

The Consistency Spectrum

CONSISTENCY MODELS SPECTRUM

STRONG                                           EVENTUAL
CONSISTENCY                                      CONSISTENCY
    │                                                │
    ▼                                                ▼
┌────────┬─────────────┬────────────┬───────────┬────────┐
│Lineari-│ Sequential  │  Causal    │ Read-your-│Eventual│
│zable   │ Consistency │ Consistency│  writes   │        │
└────────┴─────────────┴────────────┴───────────┴────────┘
    │          │             │            │          │
    │          │             │            │          │
  "One         "All         "Cause      "See       "Eventually
   copy"       see same     before      your own   converges"
               order"       effect"     writes"

Cost:   HIGH ←────────────────────────────────→ LOW
Speed:  SLOW ←────────────────────────────────→ FAST

The CAP Theorem (Practical View)

CAP IN PRACTICE

You don't "choose 2 of 3" at design time.
You choose per-operation at runtime.

Example: E-commerce platform

Operation: Browse products
├── Consistency needed: Low (eventual OK)
├── Availability needed: High (must work)
└── Decision: AP (cache reads, tolerate stale)

Operation: Place order
├── Consistency needed: High (no oversell)
├── Availability needed: Medium (brief unavailability OK)
└── Decision: CP (strong consistency, accept latency)

Operation: View order history
├── Consistency needed: Medium (read-your-writes)
├── Availability needed: High
└── Decision: Causal consistency

The Saga Pattern

SAGA vs 2PC

2PC (Two-Phase Commit):
├── Coordinator asks: "Ready to commit?"
├── All participants respond: "Yes"
├── Coordinator says: "Commit!"
├── All participants commit
│
├── Problem: Coordinator holding locks
├── Problem: Participant failure during commit
├── Problem: Blocking — everyone waits
└── Result: Rarely used in microservices

SAGA:
├── Execute T1 (has compensating C1)
├── Execute T2 (has compensating C2)
├── Execute T3 (has compensating C3)
│
├── If T3 fails:
│   ├── Execute C2 (undo T2)
│   └── Execute C1 (undo T1)
│
├── Advantage: No distributed locks
├── Advantage: Each step is independent
└── Result: Standard pattern in microservices

Interview Patterns This Week

Common Questions You'll Answer

CONSISTENCY QUESTIONS

Q: "How do you prevent overselling inventory?"
A: "Use optimistic concurrency with version checks, or
   pessimistic locking with SELECT FOR UPDATE. The choice
   depends on contention levels..."

Q: "How would you implement a money transfer across banks?"
A: "I'd use the Saga pattern with compensation. Debit first,
   then credit. If credit fails, issue compensating refund..."

Q: "What happens if your distributed lock holder crashes?"
A: "The lock has a TTL and will expire. But the dangerous
   case is if the holder is just slow — that's why we use
   fencing tokens..."

Q: "How do you handle conflicts in offline-first apps?"
A: "Depends on the data. For counters, I'd use a CRDT like
   G-Counter. For sets, an OR-Set. For arbitrary data,
   we might need application-level merge logic..."

Q: "When would you use strong consistency over eventual?"
A: "For financial transactions, inventory reservations,
   unique constraint enforcement. Anywhere that stale data
   leads to business-critical errors..."

Preparing for the Week

Mindset Shift

FROM                              TO
────                              ──
"Make it fast"           →        "Make it correct, then fast"
"Cache everything"       →        "Cache what can be stale"
"Retry on failure"       →        "Compensate on failure"
"One database"           →        "Consistency across services"
"Lock and update"        →        "Optimistic concurrency"

Key Questions to Keep in Mind

As you go through each day, ask:

  1. What consistency level does this operation need?
  2. What happens if this operation partially succeeds?
  3. How do we detect and resolve conflicts?
  4. What's the compensation strategy if we need to rollback?
  5. Is there a single leader, or can multiple nodes act?

Quick Reference: The Week's Tools

Concept Tools/Technologies
Consistency PostgreSQL isolation levels, CockroachDB, Spanner
Sagas Temporal, Cadence, AWS Step Functions, custom
Coordination ZooKeeper, etcd, Consul, Redis (limited)
CRDTs Riak, Redis CRDT types, Automerge, Yjs
Distributed Locks Redlock, ZooKeeper recipes, etcd leases

Daily Time Investment

RECOMMENDED SCHEDULE

Day 1: Consistency Models (90 minutes)
├── Core reading: 45 min
├── Implementation review: 30 min
└── Practice problems: 15 min

Day 2: Saga Pattern (90 minutes)
├── Core reading: 45 min
├── Build simple saga: 30 min
└── Interview practice: 15 min

Day 3: Orchestration (90 minutes)
├── Core reading: 40 min
├── Temporal patterns: 35 min
└── Compare approaches: 15 min

Day 4: Conflict Resolution (90 minutes)
├── Core reading: 45 min
├── CRDT implementation: 30 min
└── Practice problems: 15 min

Day 5: Leader Election (90 minutes)
├── Core reading: 45 min
├── Distributed lock patterns: 30 min
└── Failure scenario analysis: 15 min

What Success Looks Like

By the end of Week 5, you should be able to:

WEEK 5 SUCCESS CRITERIA

□ Explain the trade-offs between consistency models
□ Design a Saga for any multi-service transaction
□ Implement compensation logic for rollback scenarios
□ Choose between choreography and orchestration
□ Understand when to use CRDTs vs manual merge
□ Implement a distributed lock with fencing tokens
□ Explain split-brain and how to prevent it
□ Design systems that degrade gracefully under partition
□ Answer any consistency-related interview question

Let's Begin

Week 5 tackles the hardest problems in distributed systems. These concepts separate junior engineers who "make it work" from senior engineers who "make it correct."

After this week, you'll understand:

  • Why distributed systems are fundamentally different from single-machine systems
  • How to reason about consistency requirements for any operation
  • The patterns that make complex distributed transactions possible
  • How to coordinate multiple nodes safely

Day 1 starts with the foundation: Consistency Models. We'll learn what "consistent" actually means in distributed systems, and why it's more nuanced than you might think.


Ready to master consistency? Let's begin with Day 1: Consistency Models — understanding what consistency really means in distributed systems.