Himanshu Kukreja
0%
LearnSystem DesignFoundationsPart 6 Terminology And Theory
Foundation

Week 0 — Part 6: System Design Terminology & Theory

System Design Mastery Series


Preface

System design interviews are filled with jargon. Interviewers throw around terms like "CAP theorem," "eventual consistency," "idempotent," and "horizontal scaling" expecting you to not just know what they mean, but to reason about their tradeoffs.

Interviewer: "How would you ensure consistency in this distributed system?"

Weak answer: "I'd use a database that's consistent."

Strong answer: "We need to decide what consistency level we need. For user 
balances, we need strong consistency — linearizable reads. For the activity 
feed, eventual consistency is fine — users can tolerate seeing posts a few 
seconds late. I'd use synchronous replication for the balance table and 
async replication for the feed, accepting the CAP tradeoff that during a 
partition, the balance service might become unavailable while the feed 
remains available with stale data."

This document is your glossary and theory guide — every term you'll encounter in system design, explained with examples and interview context.


Part I: Fundamental Theorems and Trade-offs

Chapter 1: CAP Theorem

1.1 The Three Properties

CAP Theorem (Brewer's Theorem) states that in a distributed system, you can only guarantee TWO of three properties:

C - CONSISTENCY Every read receives the most recent write or an error. All nodes see the same data at the same time.

A - AVAILABILITY Every request receives a response (not an error). The system is always operational.

P - PARTITION TOLERANCE The system continues to operate despite network partitions. Messages between nodes can be lost or delayed.

The Reality: You Don't "Choose 2 of 3"

In distributed systems, network partitions WILL happen. P is not optional — it's a given.

So the real choice is: During a partition, do you want:

  • CP: Consistency (refuse requests until partition heals)
  • AP: Availability (serve requests but may be stale/inconsistent)

1.2 CAP in Practice

Partition Scenario:

  Node A                              Node B
    |                                    |
    |   [Network Partition]              |
    |   xxxxxxxxxxxxxxxxxxxxxxxx         |
    |                                    |
  Client writes                    Client reads
  value = 100                      value = ???

CP System (Consistency + Partition Tolerance): Node B refuses to answer until it can confirm with Node A. Client gets an error or timeout. "Sorry, I can't guarantee this data is current."

Examples: ZooKeeper, etcd, HBase (when configured for consistency), Traditional RDBMS with synchronous replication

AP System (Availability + Partition Tolerance): Node B returns its last known value (might be stale). Client gets a response (possibly outdated). "Here's what I have, but it might not be the latest."

Examples: Cassandra (tunable, default AP), DynamoDB (tunable), CouchDB, DNS

CA System (Consistency + Availability): Only possible without partitions — meaning single node! Not relevant for distributed systems.

Examples: Single-node PostgreSQL, Single-node MySQL

1.3 Interview Explanation

INTERVIEW QUESTION: "Explain CAP theorem and how it applies to your design."

STRONG ANSWER:

"CAP theorem states that in a distributed system experiencing a network 
partition, you must choose between consistency and availability.

For our user profile service, I'd choose AP — availability over consistency.
Here's why:

1. User experience: Showing slightly stale profile data is better than 
   showing an error.

2. Nature of data: Profile updates are infrequent. If a user updates their 
   bio and another user sees the old bio for a few seconds, it's acceptable.

3. Conflict resolution: For concurrent updates, we can use last-write-wins 
   with timestamps, or version vectors for more complex merging.

However, for the payment service, I'd choose CP — we cannot allow 
inconsistent account balances. Users would rather see an error than 
have their money disappear or duplicate.

The key is: different parts of the system can make different CAP trade-offs 
based on their requirements."

Chapter 2: PACELC Theorem

2.1 Extending CAP

PACELC: What Happens When There's NO Partition?

CAP only addresses partition scenarios. PACELC extends it:

if (Partition) {
    choose: Availability OR Consistency    // Same as CAP
} else {
    choose: Latency OR Consistency         // New insight!
}

P A C E L C
| | | | | |
| | | | | Consistency (else)
| | | | Latency (else)
| | | Else (no partition)
| | Consistency (partition)
| Availability (partition)
Partition

The Key Insight: Even without partitions, there's a trade-off:

  • Strong consistency requires coordination (adds latency)
  • Low latency requires relaxed consistency

Example: Synchronous replication

  • Write to primary
  • Wait for ALL replicas to confirm
  • Then return success to client

Consistent? Yes. Fast? No — you wait for the slowest replica.

2.2 PACELC Classifications

System Partition (PA/PC) Else (EL/EC)
DynamoDB PA (Available) EL (Low latency)
Cassandra PA (Available) EL (Low latency)
MongoDB PC (Consistent) EC (Consistent)
HBase PC (Consistent) EC (Consistent)
PNUTS (Yahoo) PC (Consistent) EL (Low latency)

PA/EL Systems (DynamoDB, Cassandra):

  • Always try to be fast and available
  • Accept eventual consistency
  • Great for: user sessions, product catalogs, social feeds

PC/EC Systems (MongoDB, HBase):

  • Always try to be consistent
  • Accept higher latency and potential unavailability
  • Great for: financial data, inventory counts, user accounts

Chapter 3: ACID Properties

3.1 The Four Properties

ACID: Guarantees for Database Transactions

A - ATOMICITY All or nothing. Transaction either fully completes or fully rolls back. No partial updates.

Example: Transfer $100 from Account A to Account B

  • Debit A: $100
  • Credit B: $100 Both happen, or neither happens. Never just one.

C - CONSISTENCY Database moves from one valid state to another. All constraints, triggers, and rules are satisfied.

Example: Account balance cannot be negative

  • If transfer would make balance negative, transaction fails
  • Database never contains invalid data

I - ISOLATION Concurrent transactions don't interfere with each other. Each transaction sees a consistent snapshot.

Example: Two transfers happening simultaneously

  • Transfer 1: A → B
  • Transfer 2: A → C Each sees correct balance, neither corrupts the other.

D - DURABILITY Once committed, data survives crashes. Written to non-volatile storage.

Example: After "COMMIT" returns success

  • Server can crash
  • Power can fail
  • Data is still there after restart

3.2 Isolation Levels

Isolation Levels (Weakest to Strongest)

1. READ UNCOMMITTED Can read data that hasn't been committed yet.

Problem: Dirty Reads

  • Transaction A updates row (not committed)
  • Transaction B reads updated value
  • Transaction A rolls back
  • Transaction B has garbage data!

Use case: Almost never. Maybe analytics where accuracy doesn't matter.

2. READ COMMITTED Can only read committed data. But same query can return different results within transaction.

Problem: Non-Repeatable Reads

  • Transaction A reads row (value = 100)
  • Transaction B updates row to 200, commits
  • Transaction A reads same row again (value = 200)
  • Different values in same transaction!

Use case: Default in PostgreSQL, Oracle. Good for most applications.

3. REPEATABLE READ Same query always returns same result within transaction. But new rows can appear.

Problem: Phantom Reads

  • Transaction A: SELECT * WHERE price < 100 (returns 10 rows)
  • Transaction B: INSERT row with price = 50, commits
  • Transaction A: SELECT * WHERE price < 100 (returns 11 rows!)
  • New row appeared!

Use case: Default in MySQL. Financial reports within a transaction.

4. SERIALIZABLE Transactions execute as if they were serial (one after another). No anomalies possible.

Implementation: Locking or optimistic concurrency

Use case: Critical financial operations. Highest consistency, lowest throughput.

Comparison Table:

Level Dirty Read Non-Repeatable Phantom
Read Uncommitted Possible Possible Possible
Read Committed No Possible Possible
Repeatable Read No No Possible
Serializable No No No

3.3 ACID Implementation

How Databases Achieve ACID

ATOMICITY: Write-Ahead Logging (WAL)

  1. Write intention to log FIRST
  2. Then modify actual data
  3. On crash: replay log to recover
Transaction starts
  → Write "BEGIN TX-123" to log
  → Write "UPDATE accounts SET balance=900 WHERE id=A" to log
  → Write "UPDATE accounts SET balance=1100 WHERE id=B" to log
  → Write "COMMIT TX-123" to log
  → Apply changes to data files

Crash before COMMIT? Transaction never happened.
Crash after COMMIT? Replay log on recovery.

ISOLATION: Locking or MVCC

Locking (Pessimistic):

  • Acquire locks before reading/writing
  • Other transactions wait
  • Simple but can cause contention

MVCC - Multi-Version Concurrency Control (Optimistic):

  • Keep multiple versions of each row
  • Each transaction sees snapshot at start time
  • Writers don't block readers
  • Used by: PostgreSQL, MySQL InnoDB, Oracle

DURABILITY: fsync to disk

After COMMIT:

  • Flush WAL to disk
  • fsync() ensures data hits persistent storage
  • Only then return success to client

Trade-off: fsync is slow (~10ms on HDD)

  • Batch commits (group commit)
  • Use fast storage (NVMe SSD)

Chapter 4: BASE Properties

4.1 BASE vs ACID

BASE: The Alternative to ACID

For distributed systems where ACID is too restrictive:

BA - BASICALLY AVAILABLE System guarantees availability (in CAP sense). Might return stale data, but always responds.

S - SOFT STATE State can change over time, even without input. Due to eventual consistency propagation.

E - EVENTUAL CONSISTENCY If no new updates, all replicas will eventually converge. No guarantee of when, but eventually consistent.

ACID vs BASE Comparison:

ACID BASE
Strong consistency Eventual consistency
Pessimistic (lock early) Optimistic (resolve later)
Complex transactions Simple operations
Vertical scaling Horizontal scaling
Traditional RDBMS NoSQL databases

When to Use Each:

ACID (Traditional databases):

  • Financial transactions
  • Inventory management
  • User account management
  • Any place where data integrity is critical

BASE (Distributed databases):

  • Social media feeds
  • Product catalogs
  • Session stores
  • Analytics data
  • Any place where eventual consistency is acceptable

4.2 Eventual Consistency in Practice

Example: Social Media Post

User posts status update in US data center:

t=0ms:   User clicks "Post"
t=10ms:  US data center saves post
t=10ms:  User sees their post (read-your-writes)
t=50ms:  Replication to EU data center starts
t=150ms: EU data center receives post

During t=10ms to t=150ms:
  - US users see the post
  - EU users don't see the post yet
  - System is inconsistent but available

After t=150ms:
  - All users see the post
  - System is eventually consistent

Consistency Guarantees (Strongest to Weakest):

  1. STRONG CONSISTENCY - After write completes, all reads see new value.

  2. READ-YOUR-WRITES - You always see your own writes. Others might not see them yet.

  3. MONOTONIC READS - Once you see a value, you never see older values. Time doesn't go backward for you.

  4. MONOTONIC WRITES - Your writes are applied in order. Write A before Write B → A applied before B.

  5. EVENTUAL CONSISTENCY - Eventually, all replicas converge. No timing guarantee.

Implementing Read-Your-Writes:

Option 1: Sticky sessions - Route user to same replica that handled their write.

Option 2: Read from primary - For user's own data, always read from primary.

Option 3: Version tracking - Client tracks version of last write. Only accept reads with version >= last write.


Chapter 5: Consistency Models

5.1 Linearizability

Linearizability (Strict Serializability)

The strongest consistency model. Operations appear to happen instantaneously at some point between invocation and response.

Every operation has a "linearization point" in real time.

Client A:  write(x=1) |-------|
Client B:              read(x) |---|  
Client C:                       write(x=2) |-------|
Client D:                                   read(x) |---|

Time →

Linearizable execution:
  write(x=1) completes
  read(x) returns 1 (must see the write)
  write(x=2) completes  
  read(x) returns 2

If read(x) started after write(x=1) completed,
it MUST return 1 or later value. Never 0.

Where It's Needed:

  • Distributed locks (must see latest lock state)
  • Leader election (must agree on leader)
  • Unique ID generation (no duplicates)

Cost:

  • Requires coordination
  • Higher latency
  • Lower throughput

5.2 Sequential Consistency

Sequential Consistency

Weaker than linearizability. All operations appear in SOME sequential order. Order is consistent with program order for each process. But doesn't have to match real-time order.

Client A: write(x=1), write(x=2)
Client B: read(x), read(x)

Valid sequential orderings:
  write(x=1) → write(x=2) → read(x)=2 → read(x)=2  ✓
  write(x=1) → read(x)=1 → write(x=2) → read(x)=2  ✓
  read(x)=0 → write(x=1) → write(x=2) → read(x)=2  ✓
  
Invalid:
  write(x=1) → write(x=2) → read(x)=2 → read(x)=1  ✗
  (Violates program order: read=2 then read=1?)
  
  write(x=2) → write(x=1) → ...  ✗
  (Violates program order: A wrote 1 before 2)

Difference from Linearizability:

  • Linearizable: real-time ordering matters
  • Sequential: only program order matters per process

5.3 Causal Consistency

Causal Consistency

Operations that are causally related must be seen in order. Concurrent operations can be seen in any order.

Causality: A "happens before" B if:

  1. A and B are in same process, A before B
  2. A is a write, B is a read that sees A
  3. Transitive: A→B and B→C means A→C

Example: Social Media Comments

Alice: posts "Hello!"           (A)
Bob:   sees Alice's post        (causally after A)
Bob:   comments "Hi Alice!"     (B, causally after A)
Carol: reads the thread

Carol MUST see:
  Alice's post before Bob's comment
  (Because B is causally dependent on A)

But two independent comments?
  Dave: comments "Nice!"
  Eve:  comments "Great!"
  
  Carol might see Dave before Eve, or Eve before Dave.
  These are concurrent, no causal relationship.

Implementation: Vector Clocks

Each node maintains vector of logical clocks: [Node1_time, Node2_time, Node3_time]

When sending message:

  • Increment own clock
  • Attach vector to message

When receiving:

  • Update vector: max(local[i], received[i]) for each i
  • Increment own clock

Compare vectors to determine causality:

  • V1 < V2 if all V1[i] <= V2[i] and at least one V1[i] < V2[i]
  • If neither V1 < V2 nor V2 < V1, they're concurrent

Part II: Scaling Concepts

Chapter 6: Scaling Strategies

6.1 Vertical vs Horizontal Scaling

VERTICAL SCALING (Scale Up)

Add more power to existing machine:

  • More CPU cores
  • More RAM
  • Faster disks
  • Better network

Before: 4 cores, 16GB RAM, 500GB HDD After: 64 cores, 512GB RAM, 4TB NVMe

Pros:

  • Simple — no code changes
  • No distributed system complexity
  • ACID transactions easy
  • Lower latency (no network hops)

Cons:

  • Hardware limits (can't buy infinite CPU)
  • Single point of failure
  • Expensive (high-end hardware costs exponentially more)
  • Downtime for upgrades

Use when:

  • Starting out (don't prematurely distribute)
  • Database with complex transactions
  • Workload fits on one machine

HORIZONTAL SCALING (Scale Out)

Add more machines:

Before:           After:
                  
  Server          Load Balancer
  handles all          |
  traffic         -----------
                  |    |    |
                 S1   S2   S3

Pros:

  • Near-infinite scaling
  • Fault tolerance (one fails, others continue)
  • Cost-effective (commodity hardware)
  • No downtime for scaling

Cons:

  • Distributed system complexity
  • Data consistency challenges
  • Network latency between nodes
  • More operational overhead

Use when:

  • Need high availability
  • Workload exceeds single machine
  • Traffic is unpredictable (auto-scale)

6.2 Stateless vs Stateful Services

STATELESS SERVICES

Server holds no client-specific state between requests. All state is external (database, cache, client).

Request 1 → Server A → Database
Request 2 → Server B → Database  (any server can handle)
Request 3 → Server C → Database

Characteristics:

  • Any server can handle any request
  • Easy to scale (add more servers)
  • Easy to recover (restart server, no data lost)
  • Load balancer can use round-robin

Examples: REST API servers, Web servers (serving static content), Serverless functions

STATEFUL SERVICES

Server holds state that's needed for subsequent requests.

Client A → Server 1 (has A's session)
Client B → Server 2 (has B's session)
Client A → must go to Server 1!

Characteristics:

  • Requests must route to specific server
  • Harder to scale (session affinity needed)
  • Failover is complex (state must be replicated)

Examples: WebSocket servers, Game servers, Database servers

Making Stateful Services Scalable:

Option 1: Externalize state - Store session in Redis/Memcached. Now servers are stateless.

Option 2: Sticky sessions - Load balancer routes same client to same server. Still have failover problems.

Option 3: State replication - Replicate state across servers. Complex but handles failover.

Option 4: Partition state - Each server owns subset of state. Route requests by partition key.


Chapter 7: Load Balancing

7.1 Load Balancing Algorithms

1. ROUND ROBIN Requests distributed in order: 1→A, 2→B, 3→C, 4→A, 5→B...

Pros: Simple, even distribution Cons: Ignores server capacity and current load Use: Homogeneous servers, similar request costs

2. WEIGHTED ROUND ROBIN Servers have weights: A(3), B(2), C(1) Distribution: A,A,A,B,B,C,A,A,A,B,B,C...

Pros: Accounts for different server capacities Cons: Static weights, doesn't adapt Use: Mixed hardware environments

3. LEAST CONNECTIONS Send to server with fewest active connections.

Pros: Adapts to actual load Cons: Doesn't account for request complexity Use: Long-lived connections, varying request times

4. WEIGHTED LEAST CONNECTIONS Least connections, but weighted by server capacity. Score = connections / weight Send to lowest score.

Use: Mixed hardware with varying connection times

5. IP HASH Hash client IP to determine server. Same client always goes to same server.

Pros: Session affinity without cookies Cons: Uneven if clients have different request rates Use: Stateful services, caching benefits

6. LEAST RESPONSE TIME Send to server with fastest recent response times.

Pros: Accounts for actual performance Cons: Needs continuous measurement Use: When response time is critical

7. RANDOM Randomly select a server.

Pros: Simple, stateless Cons: Can be uneven short-term Use: Large number of servers (law of large numbers)

7.2 Layer 4 vs Layer 7 Load Balancing

LAYER 4 (Transport Layer) LOAD BALANCING

Works at TCP/UDP level. Sees: IP addresses, ports Doesn't see: HTTP headers, URLs, cookies

How it works:

  1. Client connects to load balancer IP
  2. LB picks backend server
  3. Forwards TCP connection (NAT or DSR)
  4. All packets in connection go to same server

Pros:

  • Very fast (minimal processing)
  • Protocol agnostic (works with any TCP/UDP)
  • Low latency

Cons:

  • Can't route based on content
  • No SSL termination
  • Limited health checks

Examples: AWS NLB, HAProxy (TCP mode), LVS

LAYER 7 (Application Layer) LOAD BALANCING

Works at HTTP/HTTPS level. Sees: URLs, headers, cookies, body

How it works:

  1. Client connects to load balancer
  2. LB terminates HTTP connection
  3. LB inspects request (URL, headers)
  4. LB opens new connection to chosen backend
  5. Forwards request, returns response

Pros:

  • Content-based routing (URL, header)
  • SSL termination (offload from backends)
  • Request modification (add headers)
  • Better health checks (check HTTP response)
  • Caching, compression, rate limiting

Cons:

  • More CPU intensive
  • Higher latency
  • Must understand protocol

Examples: AWS ALB, nginx, HAProxy (HTTP mode), Envoy

Choosing Between Them:

Use L4 when:

  • Need raw performance
  • Non-HTTP protocols (databases, custom protocols)
  • SSL passthrough required

Use L7 when:

  • Need content-based routing
  • Want SSL termination
  • Need request inspection/modification
  • HTTP-specific features (websocket upgrade, gRPC)

Chapter 8: Sharding and Partitioning

8.1 Partitioning Strategies

HORIZONTAL PARTITIONING (Sharding)

Split rows across multiple databases. Each shard has same schema, different data.

Example: Users table with 100M rows

Shard 1: Users 1 - 25M
Shard 2: Users 25M - 50M
Shard 3: Users 50M - 75M
Shard 4: Users 75M - 100M

1. RANGE PARTITIONING

Partition by value ranges.

user_id 1-1M      → Shard A
user_id 1M-2M     → Shard B
user_id 2M-3M     → Shard C

Pros:

  • Range queries efficient (users 100-200 on same shard)
  • Easy to understand

Cons:

  • Hot spots (new users always hit latest shard)
  • Uneven distribution if ranges are unequal

2. HASH PARTITIONING

Partition by hash of key.

shard = hash(user_id) % num_shards

hash(1) % 4 = 1   → Shard B
hash(2) % 4 = 2   → Shard C
hash(3) % 4 = 3   → Shard D
hash(4) % 4 = 0   → Shard A

Pros:

  • Even distribution (if hash is good)
  • No hot spots

Cons:

  • Range queries hit all shards
  • Adding shards requires rehashing (use consistent hashing!)

3. DIRECTORY-BASED PARTITIONING

Lookup table maps keys to shards.

Lookup Service:
  user_1 → Shard A
  user_2 → Shard B
  user_3 → Shard A
  ...

Pros:

  • Flexible placement
  • Can move data easily

Cons:

  • Lookup service is bottleneck/SPOF
  • Extra hop for every query

4. GEOGRAPHIC PARTITIONING

Partition by location for data locality.

US users → US data center
EU users → EU data center
Asia users → Asia data center

Pros:

  • Low latency for users
  • Data sovereignty compliance

Cons:

  • Complex for global data
  • Cross-region queries expensive

8.2 Consistent Hashing

The Problem with Simple Hashing

Simple: shard = hash(key) % N

Add a server (N=4 → N=5):

hash(key) % 4 = 2  → Shard C
hash(key) % 5 = 3  → Shard D  (different!)

Almost ALL keys move to different shards! Massive data migration, cache invalidation.

Consistent Hashing

Imagine a ring (0 to 2^32):

          0
         /|\
        / | \
       /  |  \
    270   |   90
       \  |  /
        \ | /
         \|/
         180

Place servers on ring (by hashing server name):

  • hash("server-a") = 45 → position 45
  • hash("server-b") = 120 → position 120
  • hash("server-c") = 200 → position 200
  • hash("server-d") = 300 → position 300

Place keys on ring, assign to next server clockwise:

  • hash("user-123") = 100 → server-b (next after 100 is 120)
  • hash("user-456") = 250 → server-d (next after 250 is 300)

Adding a Server

Add server-e at position 150:

Before: Keys 120-200 → server-c After: Keys 120-150 → server-e Keys 150-200 → server-c

Only keys between server-b and server-e move! About 1/N of keys move instead of all keys.

Virtual Nodes

Problem: Servers might be unevenly distributed on ring.

Solution: Each physical server gets multiple positions.

server-a → positions 45, 160, 280 (3 virtual nodes)
server-b → positions 90, 200, 320 (3 virtual nodes)

More even distribution, and bigger servers can have more virtual nodes.

Used By:

  • Amazon DynamoDB
  • Apache Cassandra
  • Memcached clients
  • Content Delivery Networks

Chapter 9: Replication

9.1 Replication Strategies

SINGLE-LEADER REPLICATION

One primary (leader), multiple secondaries (followers). Writes go to leader, replicate to followers. Reads can go to any node.

  Client
    |
    v
  Leader ----→ Follower 1
    |     |
    |     +--→ Follower 2
    |     |
    |     +--→ Follower 3
    |
  (writes)    (reads can go anywhere)

SYNCHRONOUS REPLICATION

Leader waits for followers to confirm before returning success.

Client → Leader → Followers (wait for ACK) → Success to Client

Pros:

  • Strong consistency
  • No data loss on leader failure

Cons:

  • High latency (wait for slowest follower)
  • Availability impact (follower down = writes blocked)

ASYNCHRONOUS REPLICATION

Leader returns success immediately, replicates in background.

Client → Leader → Success to Client
               ↓
         (async) Followers

Pros:

  • Low latency
  • High availability

Cons:

  • Potential data loss (leader fails before replication)
  • Followers may be behind (replication lag)

SEMI-SYNCHRONOUS REPLICATION

Wait for at least one follower to confirm. Leader confirms after 1+ followers ACK. Other followers replicate async.

Balance between consistency and performance. Used by: MySQL semi-sync replication

9.2 Multi-Leader Replication

MULTI-LEADER (Multi-Master) REPLICATION

Multiple nodes accept writes. Changes replicate between all leaders.

Leader A <-----> Leader B <-----> Leader C
   |                |                |
 Reads           Reads            Reads

Use cases:

  • Multi-datacenter (leader per datacenter)
  • Offline-capable applications (each device is a leader)
  • Collaborative editing

Conflict Resolution

What if two leaders modify same record simultaneously?

Leader A: UPDATE user SET name = 'Alice'
Leader B: UPDATE user SET name = 'Bob'

Conflict! Which one wins?

Strategies:

1. LAST-WRITE-WINS (LWW) Timestamp each write, highest timestamp wins.

Simple but can lose data. Time sync issues (clock skew).

2. VERSION VECTORS Track version per node. Detect conflicts, let application resolve.

User A: [A:1, B:0] name = 'Alice'
User B: [A:0, B:1] name = 'Bob'

Neither dominates → conflict!

3. MERGE Combine both values somehow.

For shopping cart: union of items For counters: sum of increments For text: operational transformation

4. CUSTOM RESOLUTION Application-specific logic.

"Most recent login location wins" "Merge address fields, flag for review"

9.3 Leaderless Replication

LEADERLESS REPLICATION

No single leader. Client writes to multiple nodes. Reads from multiple nodes, use quorum.

Client writes to N nodes
Client reads from N nodes

R + W > N ensures overlap (quorum)

N = 3 (total replicas)
W = 2 (write to 2 nodes)
R = 2 (read from 2 nodes)

W + R = 4 > 3 → guaranteed to read latest write

Example: Write x=5, then read

Write x=5 to nodes A and B (W=2)

Node A: x=5
Node B: x=5
Node C: x=1 (stale, missed write)

Read from nodes B and C (R=2)

Get x=5 from B, x=1 from C
Return x=5 (highest version)

Quorum Configuration

  • R=1, W=N: Fast reads, slow writes, read may be stale
  • R=N, W=1: Slow reads, fast writes, write may be lost
  • R=W=N/2+1: Balanced, typical configuration

N=3, R=2, W=2: Good balance N=5, R=3, W=3: More fault tolerant

Used By:

  • Amazon DynamoDB (configurable)
  • Apache Cassandra
  • Riak

Sloppy Quorum and Hinted Handoff

If designated node is down, write to another node temporarily. "Hint" stored that data belongs elsewhere. When original node recovers, data transferred back.

Improves availability but weakens consistency guarantee.


Part III: Data Concepts

Chapter 10: Data Storage Patterns

10.1 SQL vs NoSQL

SQL (Relational) DATABASES

Structured data with fixed schema. Tables, rows, columns. SQL query language. ACID transactions.

Examples: PostgreSQL, MySQL, Oracle, SQL Server

Best for:

  • Complex queries with JOINs
  • Transactions across multiple tables
  • Strong consistency requirements
  • Well-defined schema

NoSQL DATABASES

Various data models, flexible schema. Optimized for specific access patterns.

Types:

1. KEY-VALUE STORES Simple: key → value Examples: Redis, DynamoDB, Memcached Use: Caching, session storage, simple lookups

2. DOCUMENT STORES Key → JSON/BSON document Examples: MongoDB, CouchDB, Firestore Use: Content management, catalogs, user profiles

3. COLUMN-FAMILY (Wide Column) Rows with dynamic columns, column families Examples: Cassandra, HBase, ScyllaDB Use: Time series, IoT, large-scale analytics

4. GRAPH DATABASES Nodes and relationships Examples: Neo4j, Amazon Neptune, JanusGraph Use: Social networks, recommendation, fraud detection

Comparison:

Aspect SQL NoSQL
Schema Fixed, defined Flexible, dynamic
Scaling Vertical (mainly) Horizontal (designed for)
Consistency Strong (ACID) Eventual (BASE) typically
Query Language SQL Varies by database
JOINs Native support Usually application-side
Transactions Multi-table Usually single-document

10.2 Indexing

INDEX TYPES

1. B-TREE INDEX (Most Common)

Balanced tree structure. Good for: equality, range queries, sorting

Query: WHERE age > 25 AND age < 50

Without index: Scan all rows (O(n)) With B-tree: Navigate tree (O(log n))

2. HASH INDEX

Hash table structure. Good for: exact equality only

Query: WHERE id = 12345

Hash id → find bucket → get value (O(1))

Cannot do: range queries, sorting

3. INVERTED INDEX (Full-Text Search)

Maps words to documents containing them.

"hello" → [doc1, doc3, doc7]
"world" → [doc1, doc5]

Query: "hello world"
Intersection: [doc1]

Used by: Elasticsearch, PostgreSQL full-text

4. BITMAP INDEX

Bit array for each value. Good for: low-cardinality columns

gender='M': 1,0,1,1,0,1,0,0...
gender='F': 0,1,0,0,1,0,1,1...

Fast for: COUNT, AND/OR operations

5. GEOSPATIAL INDEX (R-Tree, Quad-tree)

Indexes 2D/3D coordinates.

Query: Find all restaurants within 5km

Used by: PostGIS, MongoDB, Redis

COMPOSITE INDEX

Index on multiple columns. Order matters!

INDEX (country, city, street)

Efficient for:
  WHERE country = 'US'
  WHERE country = 'US' AND city = 'NYC'
  WHERE country = 'US' AND city = 'NYC' AND street = 'Broadway'

NOT efficient for:
  WHERE city = 'NYC'  (leftmost column not used)
  WHERE country = 'US' AND street = 'Broadway'  (gap in columns)

COVERING INDEX

Index contains all columns needed for query. No need to access table data.

INDEX (user_id, name, email)

Query: SELECT name, email WHERE user_id = 123

Index has all data → no table lookup → faster!

10.3 Write-Ahead Log and LSM Trees

WRITE-AHEAD LOG (WAL)

Before modifying data, write to append-only log. Used for durability and crash recovery.

Write path:

  1. Append operation to WAL (sequential write, fast)
  2. fsync WAL to disk
  3. Apply to in-memory data structures
  4. Periodically flush to data files (checkpoint)

Recovery:

  1. Load last checkpoint
  2. Replay WAL from checkpoint
  3. System is consistent

LSM TREE (Log-Structured Merge Tree)

Optimized for write-heavy workloads. Used by: LevelDB, RocksDB, Cassandra, HBase

Structure:

MemTable (in-memory, sorted)
    |
    | (flush when full)
    v
Level 0 SSTables (Sorted String Tables on disk)
    |
    | (compaction merges files)
    v
Level 1 SSTables
    |
    v
Level N SSTables

Write path:

  1. Write to MemTable (memory, fast)
  2. When MemTable full, flush to Level 0 SSTable
  3. Background compaction merges and sorts levels

Read path:

  1. Check MemTable
  2. Check each level (newest to oldest)
  3. Use bloom filters to skip files without key

Pros:

  • Excellent write performance (sequential I/O)
  • Compression friendly

Cons:

  • Read amplification (check multiple levels)
  • Write amplification (compaction rewrites data)
  • Space amplification (data exists in multiple levels)

Chapter 11: Caching Concepts

11.1 Cache Patterns

CACHING STRATEGIES

1. CACHE-ASIDE (Lazy Loading)

Application manages cache explicitly.

Read:

data = cache.get(key)
if data is None:
    data = db.query(key)
    cache.set(key, data)
return data

Write:

db.update(key, data)
cache.delete(key)  # Invalidate

Pros: Only requested data is cached Cons: Cache miss penalty, potential stale data

2. READ-THROUGH

Cache sits in front of database. Cache handles misses automatically.

Read:

data = cache.get(key)  # Cache fetches from DB on miss

Pros: Simpler application code Cons: Cache library must know how to query DB

3. WRITE-THROUGH

Writes go to cache, cache writes to database. Synchronous — returns after both succeed.

Write:

cache.set(key, data)  # Also writes to DB

Pros: Cache always consistent with DB Cons: Higher write latency

4. WRITE-BEHIND (Write-Back)

Write to cache, asynchronously write to database.

Write:

cache.set(key, data)  # Returns immediately
# Background process writes to DB later

Pros: Very fast writes Cons: Data loss risk if cache fails before flush

5. REFRESH-AHEAD

Proactively refresh cache before expiration.

If item accessed and TTL < threshold: Async refresh in background

Pros: Reduces cache miss latency Cons: May refresh unused data

11.2 Cache Eviction Policies

EVICTION POLICIES

When cache is full, which item to remove?

1. LRU (Least Recently Used)

Remove item that hasn't been accessed longest.

Access order: A, B, C, D, A, B Evict: C (least recently used)

Pros: Good general-purpose policy Cons: One-time scans pollute cache

Used by: Most caches by default

2. LFU (Least Frequently Used)

Remove item with fewest accesses.

Access counts: A(10), B(2), C(5), D(1) Evict: D (least frequent)

Pros: Keeps popular items Cons: Old popular items never evicted

Variant: LFU with aging (decay counts over time)

3. FIFO (First In, First Out)

Remove oldest item.

Insert order: A, B, C, D Evict: A (first inserted)

Pros: Simple to implement Cons: Doesn't consider access patterns

4. TTL (Time To Live)

Items expire after fixed time.

cache.set(key, value, ttl=300)  # Expires in 5 minutes

Pros: Bounds staleness Cons: Popular items may expire unnecessarily

5. RANDOM

Randomly select item to evict.

Pros: Simple, no metadata overhead Cons: May evict important items

Real-World Combinations:

  • Redis: Multiple policies (LRU, LFU, TTL, random)
  • Memcached: LRU with slab allocation
  • CPU caches: Pseudo-LRU (approximate for speed)

11.3 Cache Problems

CACHE PROBLEMS AND SOLUTIONS

1. CACHE STAMPEDE (Thundering Herd)

Problem:

  • Popular key expires
  • 100 requests arrive simultaneously
  • All miss cache, all query database
  • Database overwhelmed

Solutions:

a) Locking: Only one request fetches, others wait

if not cache.get(key):
    if cache.acquire_lock(key):
        data = db.query()
        cache.set(key, data)
        cache.release_lock(key)
    else:
        wait_and_retry()

b) Probabilistic early refresh: Refresh before expiration with some probability. Spreads refresh load over time.

2. CACHE PENETRATION

Problem:

  • Query for non-existent key (e.g., user ID 999999999)
  • Cache always misses
  • Database queried every time
  • Attacker can DoS with fake keys

Solutions:

a) Cache null values:

if db.query(key) is None:
    cache.set(key, NULL, ttl=60)

b) Bloom filter: Check bloom filter before database. If not in bloom filter, definitely doesn't exist.

3. CACHE BREAKDOWN

Problem:

  • Single hot key expires
  • Massive traffic hits database

Solutions: a) Never expire hot keys b) Background refresh before expiration c) Mutex lock (only one request refreshes)

4. HOT KEY

Problem:

  • One key gets disproportionate traffic
  • Single cache node overwhelmed

Solutions:

a) Local caching: Cache hot keys in application memory

b) Key replication: Store same key on multiple nodes key_1, key_2, key_3 all store same data Hash to different nodes

c) Read replicas: Multiple cache replicas for reads


Part IV: Messaging Concepts

Chapter 12: Message Queues and Streams

12.1 Queue vs Stream

MESSAGE QUEUE (e.g., RabbitMQ, SQS)

Message consumed by ONE consumer, then deleted.

Producer → [Queue] → Consumer

Messages: A, B, C, D
Consumer 1 takes: A, C
Consumer 2 takes: B, D

After consumption, messages are gone.

Characteristics:

  • Competing consumers (work distribution)
  • At-least-once delivery (with acknowledgment)
  • No replay (message deleted after consume)
  • Good for: Task queues, work distribution

MESSAGE STREAM (e.g., Kafka, Kinesis)

Messages retained, multiple consumers can read.

Producer → [Stream/Log] → Consumer Group A
                       → Consumer Group B

Messages: A, B, C, D (stored with offsets)
Consumer Group A: reads A, B, C, D at their pace
Consumer Group B: reads A, B, C, D independently

Messages retained for configured period.

Characteristics:

  • Multiple consumer groups (fan-out)
  • Replay capability (rewind to past offset)
  • Ordered within partition
  • Good for: Event sourcing, real-time analytics, audit logs

Comparison:

Feature Queue Stream
Consumption Destructive Non-destructive
Consumer groups Competing Independent
Replay No Yes
Ordering Per-queue Per-partition
Retention Until consumed Time/size based
Use case Task distribution Event streaming

12.2 Delivery Guarantees

DELIVERY SEMANTICS

1. AT-MOST-ONCE

Message may be lost, never delivered twice.

Producer sends → No acknowledgment If lost, gone forever.

Use case: Metrics, logs where loss is acceptable Trade-off: Fastest, least reliable

2. AT-LEAST-ONCE

Message will be delivered, possibly multiple times.

Producer sends → Wait for ACK If no ACK, retry May deliver twice if ACK lost but message received.

Use case: Most applications (with idempotent consumers) Trade-off: Reliable but needs deduplication

3. EXACTLY-ONCE

Message delivered exactly once.

Very hard to achieve in distributed systems!

Approaches: a) Idempotent producer + idempotent consumer Producer retries with same ID Consumer tracks processed IDs, ignores duplicates

b) Transactional messaging Kafka transactions: atomic produce + consume

Use case: Financial transactions, inventory Trade-off: Most complex, some performance cost

Implementing Exactly-Once (Practical):

Producer side (idempotent):

message_id = generate_unique_id()
send(message_id, payload)  # Retry with same ID

Consumer side (idempotent):

if message_id in processed_set:
    return  # Already processed
process(message)
processed_set.add(message_id)

With transactional outbox:

BEGIN TRANSACTION
  INSERT INTO outbox (message)
  UPDATE business_table
COMMIT

-- Separate process reads outbox, publishes to queue

Chapter 13: Event-Driven Architecture

13.1 Event Sourcing

EVENT SOURCING

Store events, not current state. Current state = replay all events.

Traditional (State-based):

Account table:
  id=123, balance=1000, last_updated=2024-01-15

Problems:
  - Lost history of how we got here
  - Can't answer "what was balance on Jan 1?"

Event Sourcing:

Events table:
  AccountCreated(id=123, initial=0)
  Deposited(id=123, amount=500)
  Deposited(id=123, amount=700)
  Withdrawn(id=123, amount=200)

Current balance:
  0 + 500 + 700 - 200 = 1000

Balance on Jan 1:
  Replay events up to Jan 1

Benefits:

  • Complete audit trail
  • Time travel (state at any point)
  • Debugging (replay events to reproduce bugs)
  • Event replay (rebuild read models, fix bugs)

Challenges:

  • Event schema evolution
  • Rebuild time for long event streams
  • Snapshots needed for performance

CQRS (Command Query Responsibility Segregation)

Separate write model (events) from read model (projections).

Commands → Write Model (Event Store)
                 |
                 | (project events)
                 v
            Read Model (Optimized for queries)
                 |
Queries → -------+

Write: Append event to event store Read: Query from denormalized read model

Read model can be:

  • Relational database with joins pre-computed
  • Search index (Elasticsearch)
  • Cache (Redis)
  • Multiple different views of same data

13.2 Saga Pattern

SAGA PATTERN

Manage distributed transactions across services. Instead of ACID, use sequence of local transactions with compensations.

Example: Order Placement

1. Order Service: Create order (PENDING)
2. Payment Service: Charge customer
3. Inventory Service: Reserve items
4. Shipping Service: Schedule delivery
5. Order Service: Mark order (CONFIRMED)

If step 3 fails (out of stock):
  - Compensate step 2: Refund payment
  - Compensate step 1: Cancel order

Saga Execution Styles

1. CHOREOGRAPHY

Each service listens for events, decides what to do.

Order Created → Payment listens → Payment Charged event
Payment Charged → Inventory listens → Items Reserved event
Items Reserved → Shipping listens → Delivery Scheduled event

Pros: Decoupled, simple for few steps Cons: Hard to track, distributed logic

2. ORCHESTRATION

Central orchestrator controls the flow.

Orchestrator:
  1. Tell Order: Create order
  2. Tell Payment: Charge
  3. Tell Inventory: Reserve
  4. Tell Shipping: Schedule
  5. Tell Order: Confirm

If failure:
  Execute compensations in reverse order

Pros: Clear flow, easy to track/debug Cons: Orchestrator is potential bottleneck

Saga Compensations

Each action needs a compensation:

Action Compensation
Create order Cancel order
Charge payment Refund payment
Reserve inventory Release inventory
Schedule shipping Cancel shipping

Compensations must be idempotent! May be called multiple times.


Part V: Reliability Concepts

Chapter 14: Fault Tolerance

14.1 Failure Types

FAILURE MODES

1. CRASH FAILURE

Node stops and doesn't respond.

Detection: Heartbeat timeout Recovery: Failover to replica

Example: Process killed, server power off

2. OMISSION FAILURE

Node fails to send or receive messages.

Send omission: Node doesn't send response Receive omission: Node doesn't receive request

Often indistinguishable from crash.

3. TIMING FAILURE

Response outside acceptable time window.

Example: Response takes 30 seconds instead of 100ms

For caller: Looks like failure Challenge: Did request succeed or not?

4. BYZANTINE FAILURE

Node behaves arbitrarily (including maliciously).

Example: Node sends different data to different nodes

Hardest to handle. Requires 3f+1 nodes to tolerate f failures.

Relevant for: Blockchain, hostile environments

Detection Methods:

Heartbeat:

  • Node periodically sends "I'm alive" message.
  • Missing heartbeats → suspected failure.
  • Challenge: Network partition vs actual failure

Timeout:

  • Request times out → suspected failure.
  • Challenge: Slow vs failed

Lease:

  • Node holds lease (timed permission).
  • Must renew before expiration.
  • No renewal → other nodes can take over.
  • Used by: Distributed locks, leader election

14.2 Redundancy Patterns

REDUNDANCY PATTERNS

1. ACTIVE-PASSIVE (Hot Standby)

Primary handles all traffic. Standby replicates data, takes over on failure.

Primary ◄═══════════════► Standby
(active)    replication    (passive)
   ↑
traffic

Failover: Promote standby to primary

Pros: Simple, standby has consistent data Cons: Standby resources underutilized

2. ACTIVE-ACTIVE

Multiple nodes handle traffic simultaneously.

Node A ◄════════════► Node B
(active)  sync/async  (active)
   ↑                     ↑
traffic               traffic

Pros: Better resource utilization, no failover needed Cons: Conflict resolution, data consistency

3. N+1 REDUNDANCY

N nodes handle normal load. +1 spare for failures.

Example: 3 nodes needed for traffic, deploy 4. One can fail without impact.

4. N+2 REDUNDANCY

One failure + one for maintenance.

Example: 3+2 = 5 nodes. Handles one failure while another is being updated.

5. 2N REDUNDANCY

Double everything.

Example: Need 3 nodes, deploy 6. Half can fail without impact.

Used for: Critical systems, disaster recovery

14.3 Circuit Breaker Pattern

CIRCUIT BREAKER

Prevent cascading failures by failing fast.

States:

  • CLOSED: Normal operation, requests pass through
  • OPEN: Failure threshold exceeded, requests fail immediately
  • HALF-OPEN: Test if downstream recovered

State Machine:

CLOSED ─────────────────────► OPEN
        (failures > threshold)
           
OPEN ─────────────────────────► HALF-OPEN
        (timeout expires)
           
HALF-OPEN ────────────────────► CLOSED
        (test request succeeds)
           
HALF-OPEN ────────────────────► OPEN
        (test request fails)

Implementation:

class CircuitBreaker:
    def __init__(self, threshold=5, timeout=30):
        self.failure_count = 0
        self.threshold = threshold
        self.timeout = timeout
        self.state = CLOSED
        self.last_failure_time = None
    
    def call(self, func):
        if self.state == OPEN:
            if time.now() - self.last_failure_time > self.timeout:
                self.state = HALF_OPEN
            else:
                raise CircuitOpenError()
        
        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise
    
    def on_success(self):
        self.failure_count = 0
        self.state = CLOSED
    
    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.now()
        if self.failure_count >= self.threshold:
            self.state = OPEN

BULKHEAD PATTERN

Isolate failures to prevent spreading. Like ship bulkheads that contain flooding.

Service A ─────► [Pool 1] ─────► Dependency X
Service A ─────► [Pool 2] ─────► Dependency Y

Pool 1 exhausted (X is slow)? Pool 2 still works (Y unaffected).

Implementation:

  • Separate thread pools per dependency
  • Separate connection pools
  • Separate rate limits

Chapter 15: SLIs, SLOs, and SLAs

15.1 Definitions

SERVICE LEVEL TERMINOLOGY

SLI (Service Level Indicator) A metric that measures service behavior.

Examples:

  • Request latency (p99 < 100ms)
  • Error rate (errors / total requests)
  • Availability (uptime / total time)
  • Throughput (requests per second)

SLO (Service Level Objective) Target value for an SLI. Internal goal.

Examples:

  • "99.9% of requests complete in < 100ms"
  • "Error rate < 0.1%"
  • "99.95% availability"

SLA (Service Level Agreement) Contract with consequences. External commitment.

Examples:

  • "99.9% uptime or customer gets credit"
  • "Response within 4 hours or penalty"

SLA should be less strict than SLO! (Buffer for unexpected issues)

Relationship:

SLI ──measures──► SLO ──commits──► SLA

SLI: "Our p99 latency is 85ms"
SLO: "We aim for p99 < 100ms"
SLA: "We guarantee p99 < 200ms" (with penalties)

15.2 Error Budgets

ERROR BUDGET

If SLO is 99.9% availability: Error budget = 0.1% downtime allowed

Per month (30 days): 0.1% × 30 × 24 × 60 = 43.2 minutes of downtime allowed

Error Budget Policy:

  • Budget remaining → Ship fast, take risks
  • Budget exhausted → Focus on reliability, freeze features
Budget at 80%: Normal development
Budget at 50%: Increased caution
Budget at 20%: Reliability focus
Budget at 0%:  Feature freeze, fix reliability

Calculating Error Budget:

Monthly error budget for 99.9% availability:

Total minutes: 30 × 24 × 60 = 43,200 minutes
Allowed downtime: 43,200 × 0.001 = 43.2 minutes

Incidents this month:
  - Deploy failure: 15 minutes
  - Database issue: 10 minutes
  
Budget used: 25 minutes
Budget remaining: 18.2 minutes (42% left)

Monitoring Error Budget:

Track:

  • Current error budget consumption rate
  • Projected budget at end of period
  • Alert when consumption rate too high

If burning budget faster than expected:

  • Investigate
  • Maybe pause risky deployments
  • Focus on reliability

Summary

Key Concepts at a Glance

THEOREMS AND TRADE-OFFS:

CAP Theorem:

  • Consistency, Availability, Partition tolerance
  • During partition: choose C or A
  • CP: refuse requests if can't guarantee consistency
  • AP: serve requests with possibly stale data

PACELC:

  • Extends CAP: what about when no partition?
  • EL: prioritize low latency
  • EC: prioritize consistency

ACID (Databases):

  • Atomicity: all or nothing
  • Consistency: valid state to valid state
  • Isolation: transactions don't interfere
  • Durability: committed = permanent

BASE (Distributed):

  • Basically Available
  • Soft state
  • Eventually consistent

CONSISTENCY MODELS (Strong to Weak):

Linearizability → Sequential → Causal → Eventual

SCALING:

  • Vertical: bigger machine (simple, limited)
  • Horizontal: more machines (complex, unlimited)
  • Stateless: easy to scale, any server works
  • Stateful: needs session affinity or externalized state

SHARDING:

  • Range: predictable, hot spots possible
  • Hash: even distribution, no range queries
  • Consistent hashing: minimal redistribution on changes

REPLICATION:

  • Single-leader: writes to one, reads from many
  • Multi-leader: writes anywhere, conflict resolution needed
  • Leaderless: quorum reads/writes (R + W > N)
  • Sync: consistent but slow
  • Async: fast but possible data loss

CACHING:

  • Patterns: cache-aside, read-through, write-through, write-behind
  • Eviction: LRU, LFU, TTL
  • Problems: stampede, penetration, breakdown, hot keys

MESSAGING:

  • Queue: one consumer, message deleted
  • Stream: multiple consumers, replay possible
  • Delivery: at-most-once, at-least-once, exactly-once
  • Saga: distributed transactions with compensations

RELIABILITY:

  • Circuit breaker: fail fast when downstream is down
  • Bulkhead: isolate failures
  • Redundancy: N+1, N+2, 2N
  • SLI: metric (latency, error rate)
  • SLO: target (99.9% availability)
  • SLA: contract (penalties for breach)
  • Error budget: allowed failure within SLO

📚 Further Reading

Books

  • Designing Data-Intensive Applications by Martin Kleppmann
    • The definitive guide to distributed systems concepts
  • Database Internals by Alex Petrov
    • Deep dive into storage engines and distributed databases
  • Site Reliability Engineering by Google
    • SLIs, SLOs, error budgets, and operational practices

Papers

Online Resources

Videos


End of Week 0 — Part 6: System Design Terminology and Theory

Week 0 Complete! You now have a solid foundation in:

  • System design methodology (Part 1)
  • Infrastructure building blocks (Part 2)
  • Back-of-envelope estimation (Part 3)
  • Networking fundamentals (Part 4)
  • Operating system concepts (Part 5)
  • Theory and terminology (Part 6)

Ready to dive into Week 1: Foundations of Scale — where we apply these concepts to real system designs!