Foundation

Week 0 — Part 6: System Design Terminology & Theory

System Design Mastery Series

Preface

System design interviews are filled with jargon. Interviewers throw around terms like "CAP theorem," "eventual consistency," "idempotent," and "horizontal scaling" expecting you to not just know what they mean, but to reason about their tradeoffs.

Interviewer: "How would you ensure consistency in this distributed system?"

Weak answer: "I'd use a database that's consistent."

Strong answer: "We need to decide what consistency level we need. For user 
balances, we need strong consistency — linearizable reads. For the activity 
feed, eventual consistency is fine — users can tolerate seeing posts a few 
seconds late. I'd use synchronous replication for the balance table and 
async replication for the feed, accepting the CAP tradeoff that during a 
partition, the balance service might become unavailable while the feed 
remains available with stale data."

This document is your glossary and theory guide — every term you'll encounter in system design, explained with examples and interview context.

Part I: Fundamental Theorems and Trade-offs

Chapter 1: CAP Theorem

1.1 The Three Properties

CAP Theorem (Brewer's Theorem) states that in a distributed system, you can only guarantee TWO of three properties:

C - CONSISTENCY Every read receives the most recent write or an error. All nodes see the same data at the same time.

A - AVAILABILITY Every request receives a response (not an error). The system is always operational.

P - PARTITION TOLERANCE The system continues to operate despite network partitions. Messages between nodes can be lost or delayed.

The Reality: You Don't "Choose 2 of 3"

In distributed systems, network partitions WILL happen. P is not optional — it's a given.

So the real choice is: During a partition, do you want:

CP: Consistency (refuse requests until partition heals)
AP: Availability (serve requests but may be stale/inconsistent)

1.2 CAP in Practice

Partition Scenario:

  Node A                              Node B
    |                                    |
    |   [Network Partition]              |
    |   xxxxxxxxxxxxxxxxxxxxxxxx         |
    |                                    |
  Client writes                    Client reads
  value = 100                      value = ???

CP System (Consistency + Partition Tolerance): Node B refuses to answer until it can confirm with Node A. Client gets an error or timeout. "Sorry, I can't guarantee this data is current."

Examples: ZooKeeper, etcd, HBase (when configured for consistency), Traditional RDBMS with synchronous replication

AP System (Availability + Partition Tolerance): Node B returns its last known value (might be stale). Client gets a response (possibly outdated). "Here's what I have, but it might not be the latest."

Examples: Cassandra (tunable, default AP), DynamoDB (tunable), CouchDB, DNS

CA System (Consistency + Availability): Only possible without partitions — meaning single node! Not relevant for distributed systems.

Examples: Single-node PostgreSQL, Single-node MySQL

1.3 Interview Explanation

INTERVIEW QUESTION: "Explain CAP theorem and how it applies to your design."

STRONG ANSWER:

"CAP theorem states that in a distributed system experiencing a network 
partition, you must choose between consistency and availability.

For our user profile service, I'd choose AP — availability over consistency.
Here's why:

1. User experience: Showing slightly stale profile data is better than 
   showing an error.

2. Nature of data: Profile updates are infrequent. If a user updates their 
   bio and another user sees the old bio for a few seconds, it's acceptable.

3. Conflict resolution: For concurrent updates, we can use last-write-wins 
   with timestamps, or version vectors for more complex merging.

However, for the payment service, I'd choose CP — we cannot allow 
inconsistent account balances. Users would rather see an error than 
have their money disappear or duplicate.

The key is: different parts of the system can make different CAP trade-offs 
based on their requirements."

Chapter 2: PACELC Theorem

2.1 Extending CAP

PACELC: What Happens When There's NO Partition?

CAP only addresses partition scenarios. PACELC extends it:

if (Partition) {
    choose: Availability OR Consistency    // Same as CAP
} else {
    choose: Latency OR Consistency         // New insight!
}

P A C E L C
| | | | | |
| | | | | Consistency (else)
| | | | Latency (else)
| | | Else (no partition)
| | Consistency (partition)
| Availability (partition)
Partition

The Key Insight: Even without partitions, there's a trade-off:

Strong consistency requires coordination (adds latency)
Low latency requires relaxed consistency

Example: Synchronous replication

Write to primary
Wait for ALL replicas to confirm
Then return success to client

Consistent? Yes. Fast? No — you wait for the slowest replica.

2.2 PACELC Classifications

System	Partition (PA/PC)	Else (EL/EC)
DynamoDB	PA (Available)	EL (Low latency)
Cassandra	PA (Available)	EL (Low latency)
MongoDB	PC (Consistent)	EC (Consistent)
HBase	PC (Consistent)	EC (Consistent)
PNUTS (Yahoo)	PC (Consistent)	EL (Low latency)

PA/EL Systems (DynamoDB, Cassandra):

Always try to be fast and available
Accept eventual consistency
Great for: user sessions, product catalogs, social feeds

PC/EC Systems (MongoDB, HBase):

Always try to be consistent
Accept higher latency and potential unavailability
Great for: financial data, inventory counts, user accounts

Chapter 3: ACID Properties

3.1 The Four Properties

ACID: Guarantees for Database Transactions

A - ATOMICITY All or nothing. Transaction either fully completes or fully rolls back. No partial updates.

Example: Transfer $100 from Account A to Account B

Debit A: $100
Credit B: $100 Both happen, or neither happens. Never just one.

C - CONSISTENCY Database moves from one valid state to another. All constraints, triggers, and rules are satisfied.

Example: Account balance cannot be negative

If transfer would make balance negative, transaction fails
Database never contains invalid data

I - ISOLATION Concurrent transactions don't interfere with each other. Each transaction sees a consistent snapshot.

Example: Two transfers happening simultaneously

Transfer 1: A → B
Transfer 2: A → C Each sees correct balance, neither corrupts the other.

D - DURABILITY Once committed, data survives crashes. Written to non-volatile storage.

Example: After "COMMIT" returns success

Server can crash
Power can fail
Data is still there after restart

3.2 Isolation Levels

Isolation Levels (Weakest to Strongest)

1. READ UNCOMMITTED Can read data that hasn't been committed yet.

Problem: Dirty Reads

Transaction A updates row (not committed)
Transaction B reads updated value
Transaction A rolls back
Transaction B has garbage data!

Use case: Almost never. Maybe analytics where accuracy doesn't matter.

2. READ COMMITTED Can only read committed data. But same query can return different results within transaction.

Problem: Non-Repeatable Reads

Transaction A reads row (value = 100)
Transaction B updates row to 200, commits
Transaction A reads same row again (value = 200)
Different values in same transaction!

Use case: Default in PostgreSQL, Oracle. Good for most applications.

3. REPEATABLE READ Same query always returns same result within transaction. But new rows can appear.

Problem: Phantom Reads

Transaction A: SELECT * WHERE price < 100 (returns 10 rows)
Transaction B: INSERT row with price = 50, commits
Transaction A: SELECT * WHERE price < 100 (returns 11 rows!)
New row appeared!

Use case: Default in MySQL. Financial reports within a transaction.

4. SERIALIZABLE Transactions execute as if they were serial (one after another). No anomalies possible.

Implementation: Locking or optimistic concurrency

Use case: Critical financial operations. Highest consistency, lowest throughput.

Comparison Table:

Level	Dirty Read	Non-Repeatable	Phantom
Read Uncommitted	Possible	Possible	Possible
Read Committed	No	Possible	Possible
Repeatable Read	No	No	Possible
Serializable	No	No	No

3.3 ACID Implementation

How Databases Achieve ACID

ATOMICITY: Write-Ahead Logging (WAL)

Write intention to log FIRST
Then modify actual data
On crash: replay log to recover

Transaction starts
  → Write "BEGIN TX-123" to log
  → Write "UPDATE accounts SET balance=900 WHERE id=A" to log
  → Write "UPDATE accounts SET balance=1100 WHERE id=B" to log
  → Write "COMMIT TX-123" to log
  → Apply changes to data files

Crash before COMMIT? Transaction never happened.
Crash after COMMIT? Replay log on recovery.

ISOLATION: Locking or MVCC

Locking (Pessimistic):

Acquire locks before reading/writing
Other transactions wait
Simple but can cause contention

MVCC - Multi-Version Concurrency Control (Optimistic):

Keep multiple versions of each row
Each transaction sees snapshot at start time
Writers don't block readers
Used by: PostgreSQL, MySQL InnoDB, Oracle

DURABILITY: fsync to disk

After COMMIT:

Flush WAL to disk
fsync() ensures data hits persistent storage
Only then return success to client

Trade-off: fsync is slow (~10ms on HDD)

Batch commits (group commit)
Use fast storage (NVMe SSD)

Chapter 4: BASE Properties

4.1 BASE vs ACID

BASE: The Alternative to ACID

For distributed systems where ACID is too restrictive:

BA - BASICALLY AVAILABLE System guarantees availability (in CAP sense). Might return stale data, but always responds.

S - SOFT STATE State can change over time, even without input. Due to eventual consistency propagation.

E - EVENTUAL CONSISTENCY If no new updates, all replicas will eventually converge. No guarantee of when, but eventually consistent.

ACID vs BASE Comparison:

ACID	BASE
Strong consistency	Eventual consistency
Pessimistic (lock early)	Optimistic (resolve later)
Complex transactions	Simple operations
Vertical scaling	Horizontal scaling
Traditional RDBMS	NoSQL databases

When to Use Each:

ACID (Traditional databases):

Financial transactions
Inventory management
User account management
Any place where data integrity is critical

BASE (Distributed databases):

Social media feeds
Product catalogs
Session stores
Analytics data
Any place where eventual consistency is acceptable

4.2 Eventual Consistency in Practice

Example: Social Media Post

User posts status update in US data center:

t=0ms:   User clicks "Post"
t=10ms:  US data center saves post
t=10ms:  User sees their post (read-your-writes)
t=50ms:  Replication to EU data center starts
t=150ms: EU data center receives post

During t=10ms to t=150ms:
  - US users see the post
  - EU users don't see the post yet
  - System is inconsistent but available

After t=150ms:
  - All users see the post
  - System is eventually consistent

Consistency Guarantees (Strongest to Weakest):

STRONG CONSISTENCY - After write completes, all reads see new value.
READ-YOUR-WRITES - You always see your own writes. Others might not see them yet.
MONOTONIC READS - Once you see a value, you never see older values. Time doesn't go backward for you.
MONOTONIC WRITES - Your writes are applied in order. Write A before Write B → A applied before B.
EVENTUAL CONSISTENCY - Eventually, all replicas converge. No timing guarantee.

Implementing Read-Your-Writes:

Option 1: Sticky sessions - Route user to same replica that handled their write.

Option 2: Read from primary - For user's own data, always read from primary.

Option 3: Version tracking - Client tracks version of last write. Only accept reads with version >= last write.

Chapter 5: Consistency Models

5.1 Linearizability

Linearizability (Strict Serializability)

The strongest consistency model. Operations appear to happen instantaneously at some point between invocation and response.

Every operation has a "linearization point" in real time.

Client A:  write(x=1) |-------|
Client B:              read(x) |---|  
Client C:                       write(x=2) |-------|
Client D:                                   read(x) |---|

Time →

Linearizable execution:
  write(x=1) completes
  read(x) returns 1 (must see the write)
  write(x=2) completes  
  read(x) returns 2

If read(x) started after write(x=1) completed,
it MUST return 1 or later value. Never 0.

Where It's Needed:

Distributed locks (must see latest lock state)
Leader election (must agree on leader)
Unique ID generation (no duplicates)

Cost:

Requires coordination
Higher latency
Lower throughput

5.2 Sequential Consistency

Sequential Consistency

Weaker than linearizability. All operations appear in SOME sequential order. Order is consistent with program order for each process. But doesn't have to match real-time order.

Client A: write(x=1), write(x=2)
Client B: read(x), read(x)

Valid sequential orderings:
  write(x=1) → write(x=2) → read(x)=2 → read(x)=2  ✓
  write(x=1) → read(x)=1 → write(x=2) → read(x)=2  ✓
  read(x)=0 → write(x=1) → write(x=2) → read(x)=2  ✓
  
Invalid:
  write(x=1) → write(x=2) → read(x)=2 → read(x)=1  ✗
  (Violates program order: read=2 then read=1?)
  
  write(x=2) → write(x=1) → ...  ✗
  (Violates program order: A wrote 1 before 2)

Difference from Linearizability:

Linearizable: real-time ordering matters
Sequential: only program order matters per process

5.3 Causal Consistency

Causal Consistency

Operations that are causally related must be seen in order. Concurrent operations can be seen in any order.

Causality: A "happens before" B if:

A and B are in same process, A before B
A is a write, B is a read that sees A
Transitive: A→B and B→C means A→C

Example: Social Media Comments

Alice: posts "Hello!"           (A)
Bob:   sees Alice's post        (causally after A)
Bob:   comments "Hi Alice!"     (B, causally after A)
Carol: reads the thread

Carol MUST see:
  Alice's post before Bob's comment
  (Because B is causally dependent on A)

But two independent comments?
  Dave: comments "Nice!"
  Eve:  comments "Great!"
  
  Carol might see Dave before Eve, or Eve before Dave.
  These are concurrent, no causal relationship.

Implementation: Vector Clocks

Each node maintains vector of logical clocks: [Node1_time, Node2_time, Node3_time]

When sending message:

Increment own clock
Attach vector to message

When receiving:

Update vector: max(local[i], received[i]) for each i
Increment own clock

Compare vectors to determine causality:

V1 < V2 if all V1[i] <= V2[i] and at least one V1[i] < V2[i]
If neither V1 < V2 nor V2 < V1, they're concurrent

Part II: Scaling Concepts

Chapter 6: Scaling Strategies

6.1 Vertical vs Horizontal Scaling

VERTICAL SCALING (Scale Up)

Add more power to existing machine:

More CPU cores
More RAM
Faster disks
Better network

Before: 4 cores, 16GB RAM, 500GB HDD After: 64 cores, 512GB RAM, 4TB NVMe

Pros:

Simple — no code changes
No distributed system complexity
ACID transactions easy
Lower latency (no network hops)

Cons:

Hardware limits (can't buy infinite CPU)
Single point of failure
Expensive (high-end hardware costs exponentially more)
Downtime for upgrades

Use when:

Starting out (don't prematurely distribute)
Database with complex transactions
Workload fits on one machine

HORIZONTAL SCALING (Scale Out)

Add more machines:

Before:           After:
                  
  Server          Load Balancer
  handles all          |
  traffic         -----------
                  |    |    |
                 S1   S2   S3

Pros:

Near-infinite scaling
Fault tolerance (one fails, others continue)
Cost-effective (commodity hardware)
No downtime for scaling

Cons:

Distributed system complexity
Data consistency challenges
Network latency between nodes
More operational overhead

Use when:

Need high availability
Workload exceeds single machine
Traffic is unpredictable (auto-scale)

6.2 Stateless vs Stateful Services

STATELESS SERVICES

Server holds no client-specific state between requests. All state is external (database, cache, client).

Request 1 → Server A → Database
Request 2 → Server B → Database  (any server can handle)
Request 3 → Server C → Database

Characteristics:

Any server can handle any request
Easy to scale (add more servers)
Easy to recover (restart server, no data lost)
Load balancer can use round-robin

Examples: REST API servers, Web servers (serving static content), Serverless functions

STATEFUL SERVICES

Server holds state that's needed for subsequent requests.

Client A → Server 1 (has A's session)
Client B → Server 2 (has B's session)
Client A → must go to Server 1!

Characteristics:

Requests must route to specific server
Harder to scale (session affinity needed)
Failover is complex (state must be replicated)

Examples: WebSocket servers, Game servers, Database servers

Making Stateful Services Scalable:

Option 1: Externalize state - Store session in Redis/Memcached. Now servers are stateless.

Option 2: Sticky sessions - Load balancer routes same client to same server. Still have failover problems.

Option 3: State replication - Replicate state across servers. Complex but handles failover.

Option 4: Partition state - Each server owns subset of state. Route requests by partition key.

Chapter 7: Load Balancing

7.1 Load Balancing Algorithms

1. ROUND ROBIN Requests distributed in order: 1→A, 2→B, 3→C, 4→A, 5→B...

Pros: Simple, even distribution Cons: Ignores server capacity and current load Use: Homogeneous servers, similar request costs

2. WEIGHTED ROUND ROBIN Servers have weights: A(3), B(2), C(1) Distribution: A,A,A,B,B,C,A,A,A,B,B,C...

Pros: Accounts for different server capacities Cons: Static weights, doesn't adapt Use: Mixed hardware environments

3. LEAST CONNECTIONS Send to server with fewest active connections.

Pros: Adapts to actual load Cons: Doesn't account for request complexity Use: Long-lived connections, varying request times

4. WEIGHTED LEAST CONNECTIONS Least connections, but weighted by server capacity. Score = connections / weight Send to lowest score.

Use: Mixed hardware with varying connection times

5. IP HASH Hash client IP to determine server. Same client always goes to same server.

Pros: Session affinity without cookies Cons: Uneven if clients have different request rates Use: Stateful services, caching benefits

6. LEAST RESPONSE TIME Send to server with fastest recent response times.

Pros: Accounts for actual performance Cons: Needs continuous measurement Use: When response time is critical

7. RANDOM Randomly select a server.

Pros: Simple, stateless Cons: Can be uneven short-term Use: Large number of servers (law of large numbers)

7.2 Layer 4 vs Layer 7 Load Balancing

LAYER 4 (Transport Layer) LOAD BALANCING

Works at TCP/UDP level. Sees: IP addresses, ports Doesn't see: HTTP headers, URLs, cookies

How it works:

Client connects to load balancer IP
LB picks backend server
Forwards TCP connection (NAT or DSR)
All packets in connection go to same server

Pros:

Very fast (minimal processing)
Protocol agnostic (works with any TCP/UDP)
Low latency

Cons:

Can't route based on content
No SSL termination
Limited health checks

Examples: AWS NLB, HAProxy (TCP mode), LVS

LAYER 7 (Application Layer) LOAD BALANCING

Works at HTTP/HTTPS level. Sees: URLs, headers, cookies, body

How it works:

Client connects to load balancer
LB terminates HTTP connection
LB inspects request (URL, headers)
LB opens new connection to chosen backend
Forwards request, returns response

Pros:

Content-based routing (URL, header)
SSL termination (offload from backends)
Request modification (add headers)
Better health checks (check HTTP response)
Caching, compression, rate limiting

Cons:

More CPU intensive
Higher latency
Must understand protocol

Examples: AWS ALB, nginx, HAProxy (HTTP mode), Envoy

Choosing Between Them:

Use L4 when:

Need raw performance
Non-HTTP protocols (databases, custom protocols)
SSL passthrough required

Use L7 when:

Need content-based routing
Want SSL termination
Need request inspection/modification
HTTP-specific features (websocket upgrade, gRPC)

Chapter 8: Sharding and Partitioning

8.1 Partitioning Strategies

HORIZONTAL PARTITIONING (Sharding)

Split rows across multiple databases. Each shard has same schema, different data.

Example: Users table with 100M rows

Shard 1: Users 1 - 25M
Shard 2: Users 25M - 50M
Shard 3: Users 50M - 75M
Shard 4: Users 75M - 100M

1. RANGE PARTITIONING

Partition by value ranges.

user_id 1-1M      → Shard A
user_id 1M-2M     → Shard B
user_id 2M-3M     → Shard C

Pros:

Range queries efficient (users 100-200 on same shard)
Easy to understand

Cons:

Hot spots (new users always hit latest shard)
Uneven distribution if ranges are unequal

2. HASH PARTITIONING

Partition by hash of key.

shard = hash(user_id) % num_shards

hash(1) % 4 = 1   → Shard B
hash(2) % 4 = 2   → Shard C
hash(3) % 4 = 3   → Shard D
hash(4) % 4 = 0   → Shard A

Pros:

Even distribution (if hash is good)
No hot spots

Cons:

Range queries hit all shards
Adding shards requires rehashing (use consistent hashing!)

3. DIRECTORY-BASED PARTITIONING

Lookup table maps keys to shards.

Lookup Service:
  user_1 → Shard A
  user_2 → Shard B
  user_3 → Shard A
  ...

Pros:

Flexible placement
Can move data easily

Cons:

Lookup service is bottleneck/SPOF
Extra hop for every query

4. GEOGRAPHIC PARTITIONING

Partition by location for data locality.

US users → US data center
EU users → EU data center
Asia users → Asia data center

Pros:

Low latency for users
Data sovereignty compliance

Cons:

Complex for global data
Cross-region queries expensive

8.2 Consistent Hashing

The Problem with Simple Hashing

Simple: shard = hash(key) % N

Add a server (N=4 → N=5):

hash(key) % 4 = 2  → Shard C
hash(key) % 5 = 3  → Shard D  (different!)

Almost ALL keys move to different shards! Massive data migration, cache invalidation.

Consistent Hashing

Imagine a ring (0 to 2^32):

          0
         /|\
        / | \
       /  |  \
    270   |   90
       \  |  /
        \ | /
         \|/
         180

Place servers on ring (by hashing server name):

hash("server-a") = 45 → position 45
hash("server-b") = 120 → position 120
hash("server-c") = 200 → position 200
hash("server-d") = 300 → position 300

Place keys on ring, assign to next server clockwise:

hash("user-123") = 100 → server-b (next after 100 is 120)
hash("user-456") = 250 → server-d (next after 250 is 300)

Adding a Server

Add server-e at position 150:

Before: Keys 120-200 → server-c After: Keys 120-150 → server-e Keys 150-200 → server-c

Only keys between server-b and server-e move! About 1/N of keys move instead of all keys.

Virtual Nodes

Problem: Servers might be unevenly distributed on ring.

Solution: Each physical server gets multiple positions.

server-a → positions 45, 160, 280 (3 virtual nodes)
server-b → positions 90, 200, 320 (3 virtual nodes)

More even distribution, and bigger servers can have more virtual nodes.

Used By:

Amazon DynamoDB
Apache Cassandra
Memcached clients
Content Delivery Networks

Chapter 9: Replication

9.1 Replication Strategies

SINGLE-LEADER REPLICATION

One primary (leader), multiple secondaries (followers). Writes go to leader, replicate to followers. Reads can go to any node.

  Client
    |
    v
  Leader ----→ Follower 1
    |     |
    |     +--→ Follower 2
    |     |
    |     +--→ Follower 3
    |
  (writes)    (reads can go anywhere)

SYNCHRONOUS REPLICATION

Leader waits for followers to confirm before returning success.

Client → Leader → Followers (wait for ACK) → Success to Client

Pros:

Strong consistency
No data loss on leader failure

Cons:

High latency (wait for slowest follower)
Availability impact (follower down = writes blocked)

ASYNCHRONOUS REPLICATION

Leader returns success immediately, replicates in background.

Client → Leader → Success to Client
               ↓
         (async) Followers

Pros:

Low latency
High availability

Cons:

Potential data loss (leader fails before replication)
Followers may be behind (replication lag)

SEMI-SYNCHRONOUS REPLICATION

Wait for at least one follower to confirm. Leader confirms after 1+ followers ACK. Other followers replicate async.

Balance between consistency and performance. Used by: MySQL semi-sync replication

9.2 Multi-Leader Replication

MULTI-LEADER (Multi-Master) REPLICATION

Multiple nodes accept writes. Changes replicate between all leaders.

Leader A <-----> Leader B <-----> Leader C
   |                |                |
 Reads           Reads            Reads

Use cases:

Multi-datacenter (leader per datacenter)
Offline-capable applications (each device is a leader)
Collaborative editing

Conflict Resolution

What if two leaders modify same record simultaneously?

Leader A: UPDATE user SET name = 'Alice'
Leader B: UPDATE user SET name = 'Bob'

Conflict! Which one wins?

Strategies:

1. LAST-WRITE-WINS (LWW) Timestamp each write, highest timestamp wins.

Simple but can lose data. Time sync issues (clock skew).

2. VERSION VECTORS Track version per node. Detect conflicts, let application resolve.

User A: [A:1, B:0] name = 'Alice'
User B: [A:0, B:1] name = 'Bob'

Neither dominates → conflict!

3. MERGE Combine both values somehow.

For shopping cart: union of items For counters: sum of increments For text: operational transformation

4. CUSTOM RESOLUTION Application-specific logic.

"Most recent login location wins" "Merge address fields, flag for review"

9.3 Leaderless Replication

LEADERLESS REPLICATION

No single leader. Client writes to multiple nodes. Reads from multiple nodes, use quorum.

Client writes to N nodes
Client reads from N nodes

R + W > N ensures overlap (quorum)

N = 3 (total replicas)
W = 2 (write to 2 nodes)
R = 2 (read from 2 nodes)

W + R = 4 > 3 → guaranteed to read latest write

Example: Write x=5, then read

Write x=5 to nodes A and B (W=2)

Node A: x=5
Node B: x=5
Node C: x=1 (stale, missed write)

Read from nodes B and C (R=2)

Get x=5 from B, x=1 from C
Return x=5 (highest version)

Quorum Configuration

R=1, W=N: Fast reads, slow writes, read may be stale
R=N, W=1: Slow reads, fast writes, write may be lost
R=W=N/2+1: Balanced, typical configuration

N=3, R=2, W=2: Good balance N=5, R=3, W=3: More fault tolerant

Used By:

Amazon DynamoDB (configurable)
Apache Cassandra
Riak

Sloppy Quorum and Hinted Handoff

If designated node is down, write to another node temporarily. "Hint" stored that data belongs elsewhere. When original node recovers, data transferred back.

Improves availability but weakens consistency guarantee.

Part III: Data Concepts

Chapter 10: Data Storage Patterns

10.1 SQL vs NoSQL

SQL (Relational) DATABASES

Structured data with fixed schema. Tables, rows, columns. SQL query language. ACID transactions.

Examples: PostgreSQL, MySQL, Oracle, SQL Server

Best for:

Complex queries with JOINs
Transactions across multiple tables
Strong consistency requirements
Well-defined schema

NoSQL DATABASES

Various data models, flexible schema. Optimized for specific access patterns.

Types:

1. KEY-VALUE STORES Simple: key → value Examples: Redis, DynamoDB, Memcached Use: Caching, session storage, simple lookups

2. DOCUMENT STORES Key → JSON/BSON document Examples: MongoDB, CouchDB, Firestore Use: Content management, catalogs, user profiles

3. COLUMN-FAMILY (Wide Column) Rows with dynamic columns, column families Examples: Cassandra, HBase, ScyllaDB Use: Time series, IoT, large-scale analytics

4. GRAPH DATABASES Nodes and relationships Examples: Neo4j, Amazon Neptune, JanusGraph Use: Social networks, recommendation, fraud detection

Comparison:

Aspect	SQL	NoSQL
Schema	Fixed, defined	Flexible, dynamic
Scaling	Vertical (mainly)	Horizontal (designed for)
Consistency	Strong (ACID)	Eventual (BASE) typically
Query Language	SQL	Varies by database
JOINs	Native support	Usually application-side
Transactions	Multi-table	Usually single-document

10.2 Indexing

INDEX TYPES

1. B-TREE INDEX (Most Common)

Balanced tree structure. Good for: equality, range queries, sorting

Query: WHERE age > 25 AND age < 50

Without index: Scan all rows (O(n)) With B-tree: Navigate tree (O(log n))

2. HASH INDEX

Hash table structure. Good for: exact equality only

Query: WHERE id = 12345

Hash id → find bucket → get value (O(1))

Cannot do: range queries, sorting

3. INVERTED INDEX (Full-Text Search)

Maps words to documents containing them.

"hello" → [doc1, doc3, doc7]
"world" → [doc1, doc5]

Query: "hello world"
Intersection: [doc1]

Used by: Elasticsearch, PostgreSQL full-text

4. BITMAP INDEX

Bit array for each value. Good for: low-cardinality columns

gender='M': 1,0,1,1,0,1,0,0...
gender='F': 0,1,0,0,1,0,1,1...

Fast for: COUNT, AND/OR operations

5. GEOSPATIAL INDEX (R-Tree, Quad-tree)

Indexes 2D/3D coordinates.

Query: Find all restaurants within 5km

Used by: PostGIS, MongoDB, Redis

COMPOSITE INDEX

Index on multiple columns. Order matters!

INDEX (country, city, street)

Efficient for:
  WHERE country = 'US'
  WHERE country = 'US' AND city = 'NYC'
  WHERE country = 'US' AND city = 'NYC' AND street = 'Broadway'

NOT efficient for:
  WHERE city = 'NYC'  (leftmost column not used)
  WHERE country = 'US' AND street = 'Broadway'  (gap in columns)

COVERING INDEX

Index contains all columns needed for query. No need to access table data.

INDEX (user_id, name, email)

Query: SELECT name, email WHERE user_id = 123

Index has all data → no table lookup → faster!

10.3 Write-Ahead Log and LSM Trees

WRITE-AHEAD LOG (WAL)

Before modifying data, write to append-only log. Used for durability and crash recovery.

Write path:

Append operation to WAL (sequential write, fast)
fsync WAL to disk
Apply to in-memory data structures
Periodically flush to data files (checkpoint)

Recovery:

Load last checkpoint
Replay WAL from checkpoint
System is consistent

LSM TREE (Log-Structured Merge Tree)

Optimized for write-heavy workloads. Used by: LevelDB, RocksDB, Cassandra, HBase

Structure:

MemTable (in-memory, sorted)
    |
    | (flush when full)
    v
Level 0 SSTables (Sorted String Tables on disk)
    |
    | (compaction merges files)
    v
Level 1 SSTables
    |
    v
Level N SSTables

Write path:

Write to MemTable (memory, fast)
When MemTable full, flush to Level 0 SSTable
Background compaction merges and sorts levels

Read path:

Check MemTable
Check each level (newest to oldest)
Use bloom filters to skip files without key

Pros:

Excellent write performance (sequential I/O)
Compression friendly

Cons:

Read amplification (check multiple levels)
Write amplification (compaction rewrites data)
Space amplification (data exists in multiple levels)

Chapter 11: Caching Concepts

11.1 Cache Patterns

CACHING STRATEGIES

1. CACHE-ASIDE (Lazy Loading)

Application manages cache explicitly.

Read:

data = cache.get(key)
if data is None:
    data = db.query(key)
    cache.set(key, data)
return data

Write:

db.update(key, data)
cache.delete(key)  # Invalidate

Pros: Only requested data is cached Cons: Cache miss penalty, potential stale data

2. READ-THROUGH

Cache sits in front of database. Cache handles misses automatically.

Read:

data = cache.get(key)  # Cache fetches from DB on miss

Pros: Simpler application code Cons: Cache library must know how to query DB

3. WRITE-THROUGH

Writes go to cache, cache writes to database. Synchronous — returns after both succeed.

Write:

cache.set(key, data)  # Also writes to DB

Pros: Cache always consistent with DB Cons: Higher write latency

4. WRITE-BEHIND (Write-Back)

Write to cache, asynchronously write to database.

Write:

cache.set(key, data)  # Returns immediately
# Background process writes to DB later

Pros: Very fast writes Cons: Data loss risk if cache fails before flush

5. REFRESH-AHEAD

Proactively refresh cache before expiration.

If item accessed and TTL < threshold: Async refresh in background

Pros: Reduces cache miss latency Cons: May refresh unused data

11.2 Cache Eviction Policies

EVICTION POLICIES

When cache is full, which item to remove?

1. LRU (Least Recently Used)

Remove item that hasn't been accessed longest.

Access order: A, B, C, D, A, B Evict: C (least recently used)

Pros: Good general-purpose policy Cons: One-time scans pollute cache

Used by: Most caches by default

2. LFU (Least Frequently Used)

Remove item with fewest accesses.

Access counts: A(10), B(2), C(5), D(1) Evict: D (least frequent)

Pros: Keeps popular items Cons: Old popular items never evicted

Variant: LFU with aging (decay counts over time)

3. FIFO (First In, First Out)

Remove oldest item.

Insert order: A, B, C, D Evict: A (first inserted)

Pros: Simple to implement Cons: Doesn't consider access patterns

4. TTL (Time To Live)

Items expire after fixed time.

cache.set(key, value, ttl=300)  # Expires in 5 minutes

Pros: Bounds staleness Cons: Popular items may expire unnecessarily

5. RANDOM

Randomly select item to evict.

Pros: Simple, no metadata overhead Cons: May evict important items

Real-World Combinations:

Redis: Multiple policies (LRU, LFU, TTL, random)
Memcached: LRU with slab allocation
CPU caches: Pseudo-LRU (approximate for speed)

11.3 Cache Problems

CACHE PROBLEMS AND SOLUTIONS

1. CACHE STAMPEDE (Thundering Herd)

Problem:

Popular key expires
100 requests arrive simultaneously
All miss cache, all query database
Database overwhelmed

Solutions:

a) Locking: Only one request fetches, others wait

if not cache.get(key):
    if cache.acquire_lock(key):
        data = db.query()
        cache.set(key, data)
        cache.release_lock(key)
    else:
        wait_and_retry()

b) Probabilistic early refresh: Refresh before expiration with some probability. Spreads refresh load over time.

2. CACHE PENETRATION

Problem:

Query for non-existent key (e.g., user ID 999999999)
Cache always misses
Database queried every time
Attacker can DoS with fake keys

Solutions:

a) Cache null values:

if db.query(key) is None:
    cache.set(key, NULL, ttl=60)

b) Bloom filter: Check bloom filter before database. If not in bloom filter, definitely doesn't exist.

3. CACHE BREAKDOWN

Problem:

Single hot key expires
Massive traffic hits database

Solutions: a) Never expire hot keys b) Background refresh before expiration c) Mutex lock (only one request refreshes)

4. HOT KEY

Problem:

One key gets disproportionate traffic
Single cache node overwhelmed

Solutions:

a) Local caching: Cache hot keys in application memory

b) Key replication: Store same key on multiple nodes key_1, key_2, key_3 all store same data Hash to different nodes

c) Read replicas: Multiple cache replicas for reads

Part IV: Messaging Concepts

Chapter 12: Message Queues and Streams

12.1 Queue vs Stream

MESSAGE QUEUE (e.g., RabbitMQ, SQS)

Message consumed by ONE consumer, then deleted.

Producer → [Queue] → Consumer

Messages: A, B, C, D
Consumer 1 takes: A, C
Consumer 2 takes: B, D

After consumption, messages are gone.

Characteristics:

Competing consumers (work distribution)
At-least-once delivery (with acknowledgment)
No replay (message deleted after consume)
Good for: Task queues, work distribution

MESSAGE STREAM (e.g., Kafka, Kinesis)

Messages retained, multiple consumers can read.

Producer → [Stream/Log] → Consumer Group A
                       → Consumer Group B

Messages: A, B, C, D (stored with offsets)
Consumer Group A: reads A, B, C, D at their pace
Consumer Group B: reads A, B, C, D independently

Messages retained for configured period.

Characteristics:

Multiple consumer groups (fan-out)
Replay capability (rewind to past offset)
Ordered within partition
Good for: Event sourcing, real-time analytics, audit logs

Comparison:

Feature	Queue	Stream
Consumption	Destructive	Non-destructive
Consumer groups	Competing	Independent
Replay	No	Yes
Ordering	Per-queue	Per-partition
Retention	Until consumed	Time/size based
Use case	Task distribution	Event streaming

12.2 Delivery Guarantees

DELIVERY SEMANTICS

1. AT-MOST-ONCE

Message may be lost, never delivered twice.

Producer sends → No acknowledgment If lost, gone forever.

Use case: Metrics, logs where loss is acceptable Trade-off: Fastest, least reliable

2. AT-LEAST-ONCE

Message will be delivered, possibly multiple times.

Producer sends → Wait for ACK If no ACK, retry May deliver twice if ACK lost but message received.

Use case: Most applications (with idempotent consumers) Trade-off: Reliable but needs deduplication

3. EXACTLY-ONCE

Message delivered exactly once.

Very hard to achieve in distributed systems!

Approaches: a) Idempotent producer + idempotent consumer Producer retries with same ID Consumer tracks processed IDs, ignores duplicates

b) Transactional messaging Kafka transactions: atomic produce + consume

Use case: Financial transactions, inventory Trade-off: Most complex, some performance cost

Implementing Exactly-Once (Practical):

Producer side (idempotent):

message_id = generate_unique_id()
send(message_id, payload)  # Retry with same ID

Consumer side (idempotent):

if message_id in processed_set:
    return  # Already processed
process(message)
processed_set.add(message_id)

With transactional outbox:

BEGIN TRANSACTION
  INSERT INTO outbox (message)
  UPDATE business_table
COMMIT

-- Separate process reads outbox, publishes to queue

Chapter 13: Event-Driven Architecture

13.1 Event Sourcing

EVENT SOURCING

Store events, not current state. Current state = replay all events.

Traditional (State-based):

Account table:
  id=123, balance=1000, last_updated=2024-01-15

Problems:
  - Lost history of how we got here
  - Can't answer "what was balance on Jan 1?"

Event Sourcing:

Events table:
  AccountCreated(id=123, initial=0)
  Deposited(id=123, amount=500)
  Deposited(id=123, amount=700)
  Withdrawn(id=123, amount=200)

Current balance:
  0 + 500 + 700 - 200 = 1000

Balance on Jan 1:
  Replay events up to Jan 1

Benefits:

Complete audit trail
Time travel (state at any point)
Debugging (replay events to reproduce bugs)
Event replay (rebuild read models, fix bugs)

Challenges:

Event schema evolution
Rebuild time for long event streams
Snapshots needed for performance

CQRS (Command Query Responsibility Segregation)

Separate write model (events) from read model (projections).

Commands → Write Model (Event Store)
                 |
                 | (project events)
                 v
            Read Model (Optimized for queries)
                 |
Queries → -------+

Write: Append event to event store Read: Query from denormalized read model

Read model can be:

Relational database with joins pre-computed
Search index (Elasticsearch)
Cache (Redis)
Multiple different views of same data

13.2 Saga Pattern

SAGA PATTERN

Manage distributed transactions across services. Instead of ACID, use sequence of local transactions with compensations.

Example: Order Placement

1. Order Service: Create order (PENDING)
2. Payment Service: Charge customer
3. Inventory Service: Reserve items
4. Shipping Service: Schedule delivery
5. Order Service: Mark order (CONFIRMED)

If step 3 fails (out of stock):
  - Compensate step 2: Refund payment
  - Compensate step 1: Cancel order

Saga Execution Styles

1. CHOREOGRAPHY

Each service listens for events, decides what to do.

Order Created → Payment listens → Payment Charged event
Payment Charged → Inventory listens → Items Reserved event
Items Reserved → Shipping listens → Delivery Scheduled event

Pros: Decoupled, simple for few steps Cons: Hard to track, distributed logic

2. ORCHESTRATION

Central orchestrator controls the flow.

Orchestrator:
  1. Tell Order: Create order
  2. Tell Payment: Charge
  3. Tell Inventory: Reserve
  4. Tell Shipping: Schedule
  5. Tell Order: Confirm

If failure:
  Execute compensations in reverse order

Pros: Clear flow, easy to track/debug Cons: Orchestrator is potential bottleneck

Saga Compensations

Each action needs a compensation:

Action	Compensation
Create order	Cancel order
Charge payment	Refund payment
Reserve inventory	Release inventory
Schedule shipping	Cancel shipping

Compensations must be idempotent! May be called multiple times.

Part V: Reliability Concepts

Chapter 14: Fault Tolerance

14.1 Failure Types

FAILURE MODES

1. CRASH FAILURE

Node stops and doesn't respond.

Detection: Heartbeat timeout Recovery: Failover to replica

Example: Process killed, server power off

2. OMISSION FAILURE

Node fails to send or receive messages.

Send omission: Node doesn't send response Receive omission: Node doesn't receive request

Often indistinguishable from crash.

3. TIMING FAILURE

Response outside acceptable time window.

Example: Response takes 30 seconds instead of 100ms

For caller: Looks like failure Challenge: Did request succeed or not?

4. BYZANTINE FAILURE

Node behaves arbitrarily (including maliciously).

Example: Node sends different data to different nodes

Hardest to handle. Requires 3f+1 nodes to tolerate f failures.

Relevant for: Blockchain, hostile environments

Detection Methods:

Heartbeat:

Node periodically sends "I'm alive" message.
Missing heartbeats → suspected failure.
Challenge: Network partition vs actual failure

Timeout:

Request times out → suspected failure.
Challenge: Slow vs failed

Lease:

Node holds lease (timed permission).
Must renew before expiration.
No renewal → other nodes can take over.
Used by: Distributed locks, leader election

14.2 Redundancy Patterns

REDUNDANCY PATTERNS

1. ACTIVE-PASSIVE (Hot Standby)

Primary handles all traffic. Standby replicates data, takes over on failure.

Primary ◄═══════════════► Standby
(active)    replication    (passive)
   ↑
traffic

Failover: Promote standby to primary

Pros: Simple, standby has consistent data Cons: Standby resources underutilized

2. ACTIVE-ACTIVE

Multiple nodes handle traffic simultaneously.

Node A ◄════════════► Node B
(active)  sync/async  (active)
   ↑                     ↑
traffic               traffic

Pros: Better resource utilization, no failover needed Cons: Conflict resolution, data consistency

3. N+1 REDUNDANCY

N nodes handle normal load. +1 spare for failures.

Example: 3 nodes needed for traffic, deploy 4. One can fail without impact.

4. N+2 REDUNDANCY

One failure + one for maintenance.

Example: 3+2 = 5 nodes. Handles one failure while another is being updated.

5. 2N REDUNDANCY

Double everything.

Example: Need 3 nodes, deploy 6. Half can fail without impact.

Used for: Critical systems, disaster recovery

14.3 Circuit Breaker Pattern

CIRCUIT BREAKER

Prevent cascading failures by failing fast.

States:

CLOSED: Normal operation, requests pass through
OPEN: Failure threshold exceeded, requests fail immediately
HALF-OPEN: Test if downstream recovered

State Machine:

CLOSED ─────────────────────► OPEN
        (failures > threshold)
           
OPEN ─────────────────────────► HALF-OPEN
        (timeout expires)
           
HALF-OPEN ────────────────────► CLOSED
        (test request succeeds)
           
HALF-OPEN ────────────────────► OPEN
        (test request fails)

Implementation:

class CircuitBreaker:
    def __init__(self, threshold=5, timeout=30):
        self.failure_count = 0
        self.threshold = threshold
        self.timeout = timeout
        self.state = CLOSED
        self.last_failure_time = None
    
    def call(self, func):
        if self.state == OPEN:
            if time.now() - self.last_failure_time > self.timeout:
                self.state = HALF_OPEN
            else:
                raise CircuitOpenError()
        
        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise
    
    def on_success(self):
        self.failure_count = 0
        self.state = CLOSED
    
    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.now()
        if self.failure_count >= self.threshold:
            self.state = OPEN

BULKHEAD PATTERN

Isolate failures to prevent spreading. Like ship bulkheads that contain flooding.

Service A ─────► [Pool 1] ─────► Dependency X
Service A ─────► [Pool 2] ─────► Dependency Y

Pool 1 exhausted (X is slow)? Pool 2 still works (Y unaffected).

Implementation:

Separate thread pools per dependency
Separate connection pools
Separate rate limits

Chapter 15: SLIs, SLOs, and SLAs

15.1 Definitions

SERVICE LEVEL TERMINOLOGY

SLI (Service Level Indicator) A metric that measures service behavior.

Examples:

Request latency (p99 < 100ms)
Error rate (errors / total requests)
Availability (uptime / total time)
Throughput (requests per second)

SLO (Service Level Objective) Target value for an SLI. Internal goal.

Examples:

"99.9% of requests complete in < 100ms"
"Error rate < 0.1%"
"99.95% availability"

SLA (Service Level Agreement) Contract with consequences. External commitment.

Examples:

"99.9% uptime or customer gets credit"
"Response within 4 hours or penalty"

SLA should be less strict than SLO! (Buffer for unexpected issues)

Relationship:

SLI ──measures──► SLO ──commits──► SLA

SLI: "Our p99 latency is 85ms"
SLO: "We aim for p99 < 100ms"
SLA: "We guarantee p99 < 200ms" (with penalties)

15.2 Error Budgets

ERROR BUDGET

If SLO is 99.9% availability: Error budget = 0.1% downtime allowed

Per month (30 days): 0.1% × 30 × 24 × 60 = 43.2 minutes of downtime allowed

Error Budget Policy:

Budget remaining → Ship fast, take risks
Budget exhausted → Focus on reliability, freeze features

Budget at 80%: Normal development
Budget at 50%: Increased caution
Budget at 20%: Reliability focus
Budget at 0%:  Feature freeze, fix reliability

Calculating Error Budget:

Monthly error budget for 99.9% availability:

Total minutes: 30 × 24 × 60 = 43,200 minutes
Allowed downtime: 43,200 × 0.001 = 43.2 minutes

Incidents this month:
  - Deploy failure: 15 minutes
  - Database issue: 10 minutes
  
Budget used: 25 minutes
Budget remaining: 18.2 minutes (42% left)

Monitoring Error Budget:

Track:

Current error budget consumption rate
Projected budget at end of period
Alert when consumption rate too high

If burning budget faster than expected:

Investigate
Maybe pause risky deployments
Focus on reliability

Summary

Key Concepts at a Glance

THEOREMS AND TRADE-OFFS:

CAP Theorem:

Consistency, Availability, Partition tolerance
During partition: choose C or A
CP: refuse requests if can't guarantee consistency
AP: serve requests with possibly stale data

PACELC:

Extends CAP: what about when no partition?
EL: prioritize low latency
EC: prioritize consistency

ACID (Databases):

Atomicity: all or nothing
Consistency: valid state to valid state
Isolation: transactions don't interfere
Durability: committed = permanent

BASE (Distributed):

Basically Available
Soft state
Eventually consistent

CONSISTENCY MODELS (Strong to Weak):

Linearizability → Sequential → Causal → Eventual

SCALING:

Vertical: bigger machine (simple, limited)
Horizontal: more machines (complex, unlimited)
Stateless: easy to scale, any server works
Stateful: needs session affinity or externalized state

SHARDING:

Range: predictable, hot spots possible
Hash: even distribution, no range queries
Consistent hashing: minimal redistribution on changes

REPLICATION:

Single-leader: writes to one, reads from many
Multi-leader: writes anywhere, conflict resolution needed
Leaderless: quorum reads/writes (R + W > N)
Sync: consistent but slow
Async: fast but possible data loss

CACHING:

Patterns: cache-aside, read-through, write-through, write-behind
Eviction: LRU, LFU, TTL
Problems: stampede, penetration, breakdown, hot keys

MESSAGING:

Queue: one consumer, message deleted
Stream: multiple consumers, replay possible
Delivery: at-most-once, at-least-once, exactly-once
Saga: distributed transactions with compensations

RELIABILITY:

Circuit breaker: fail fast when downstream is down
Bulkhead: isolate failures
Redundancy: N+1, N+2, 2N
SLI: metric (latency, error rate)
SLO: target (99.9% availability)
SLA: contract (penalties for breach)
Error budget: allowed failure within SLO

📚 Further Reading

Books

Designing Data-Intensive Applications by Martin Kleppmann
- The definitive guide to distributed systems concepts
Database Internals by Alex Petrov
- Deep dive into storage engines and distributed databases
Site Reliability Engineering by Google
- SLIs, SLOs, error budgets, and operational practices

Papers

CAP Twelve Years Later by Eric Brewer
- https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed/
Harvest, Yield, and Scalable Tolerant Systems by Fox & Brewer
Dynamo: Amazon's Highly Available Key-value Store
Spanner: Google's Globally-Distributed Database

Online Resources

Distributed Systems for Fun and Profit (free book)
- http://book.mixu.net/distsys/
jepsen.io - Distributed systems correctness testing
Martin Kleppmann's Blog - https://martin.kleppmann.com/
High Scalability Blog - http://highscalability.com/

Videos

MIT 6.824: Distributed Systems (free course)
- https://pdos.csail.mit.edu/6.824/
Martin Kleppmann's Conference Talks

End of Week 0 — Part 6: System Design Terminology and Theory

Week 0 Complete! You now have a solid foundation in:

System design methodology (Part 1)
Infrastructure building blocks (Part 2)
Back-of-envelope estimation (Part 3)
Networking fundamentals (Part 4)
Operating system concepts (Part 5)
Theory and terminology (Part 6)

Ready to dive into Week 1: Foundations of Scale — where we apply these concepts to real system designs!

Back to Course Overview