Week 0 — Part 6: System Design Terminology & Theory
System Design Mastery Series
Preface
System design interviews are filled with jargon. Interviewers throw around terms like "CAP theorem," "eventual consistency," "idempotent," and "horizontal scaling" expecting you to not just know what they mean, but to reason about their tradeoffs.
Interviewer: "How would you ensure consistency in this distributed system?"
Weak answer: "I'd use a database that's consistent."
Strong answer: "We need to decide what consistency level we need. For user
balances, we need strong consistency — linearizable reads. For the activity
feed, eventual consistency is fine — users can tolerate seeing posts a few
seconds late. I'd use synchronous replication for the balance table and
async replication for the feed, accepting the CAP tradeoff that during a
partition, the balance service might become unavailable while the feed
remains available with stale data."
This document is your glossary and theory guide — every term you'll encounter in system design, explained with examples and interview context.
Part I: Fundamental Theorems and Trade-offs
Chapter 1: CAP Theorem
1.1 The Three Properties
CAP Theorem (Brewer's Theorem) states that in a distributed system, you can only guarantee TWO of three properties:
C - CONSISTENCY Every read receives the most recent write or an error. All nodes see the same data at the same time.
A - AVAILABILITY Every request receives a response (not an error). The system is always operational.
P - PARTITION TOLERANCE The system continues to operate despite network partitions. Messages between nodes can be lost or delayed.
The Reality: You Don't "Choose 2 of 3"
In distributed systems, network partitions WILL happen. P is not optional — it's a given.
So the real choice is: During a partition, do you want:
- CP: Consistency (refuse requests until partition heals)
- AP: Availability (serve requests but may be stale/inconsistent)
1.2 CAP in Practice
Partition Scenario:
Node A Node B
| |
| [Network Partition] |
| xxxxxxxxxxxxxxxxxxxxxxxx |
| |
Client writes Client reads
value = 100 value = ???
CP System (Consistency + Partition Tolerance): Node B refuses to answer until it can confirm with Node A. Client gets an error or timeout. "Sorry, I can't guarantee this data is current."
Examples: ZooKeeper, etcd, HBase (when configured for consistency), Traditional RDBMS with synchronous replication
AP System (Availability + Partition Tolerance): Node B returns its last known value (might be stale). Client gets a response (possibly outdated). "Here's what I have, but it might not be the latest."
Examples: Cassandra (tunable, default AP), DynamoDB (tunable), CouchDB, DNS
CA System (Consistency + Availability): Only possible without partitions — meaning single node! Not relevant for distributed systems.
Examples: Single-node PostgreSQL, Single-node MySQL
1.3 Interview Explanation
INTERVIEW QUESTION: "Explain CAP theorem and how it applies to your design."
STRONG ANSWER:
"CAP theorem states that in a distributed system experiencing a network
partition, you must choose between consistency and availability.
For our user profile service, I'd choose AP — availability over consistency.
Here's why:
1. User experience: Showing slightly stale profile data is better than
showing an error.
2. Nature of data: Profile updates are infrequent. If a user updates their
bio and another user sees the old bio for a few seconds, it's acceptable.
3. Conflict resolution: For concurrent updates, we can use last-write-wins
with timestamps, or version vectors for more complex merging.
However, for the payment service, I'd choose CP — we cannot allow
inconsistent account balances. Users would rather see an error than
have their money disappear or duplicate.
The key is: different parts of the system can make different CAP trade-offs
based on their requirements."
Chapter 2: PACELC Theorem
2.1 Extending CAP
PACELC: What Happens When There's NO Partition?
CAP only addresses partition scenarios. PACELC extends it:
if (Partition) {
choose: Availability OR Consistency // Same as CAP
} else {
choose: Latency OR Consistency // New insight!
}
P A C E L C
| | | | | |
| | | | | Consistency (else)
| | | | Latency (else)
| | | Else (no partition)
| | Consistency (partition)
| Availability (partition)
Partition
The Key Insight: Even without partitions, there's a trade-off:
- Strong consistency requires coordination (adds latency)
- Low latency requires relaxed consistency
Example: Synchronous replication
- Write to primary
- Wait for ALL replicas to confirm
- Then return success to client
Consistent? Yes. Fast? No — you wait for the slowest replica.
2.2 PACELC Classifications
| System | Partition (PA/PC) | Else (EL/EC) |
|---|---|---|
| DynamoDB | PA (Available) | EL (Low latency) |
| Cassandra | PA (Available) | EL (Low latency) |
| MongoDB | PC (Consistent) | EC (Consistent) |
| HBase | PC (Consistent) | EC (Consistent) |
| PNUTS (Yahoo) | PC (Consistent) | EL (Low latency) |
PA/EL Systems (DynamoDB, Cassandra):
- Always try to be fast and available
- Accept eventual consistency
- Great for: user sessions, product catalogs, social feeds
PC/EC Systems (MongoDB, HBase):
- Always try to be consistent
- Accept higher latency and potential unavailability
- Great for: financial data, inventory counts, user accounts
Chapter 3: ACID Properties
3.1 The Four Properties
ACID: Guarantees for Database Transactions
A - ATOMICITY All or nothing. Transaction either fully completes or fully rolls back. No partial updates.
Example: Transfer $100 from Account A to Account B
- Debit A: $100
- Credit B: $100 Both happen, or neither happens. Never just one.
C - CONSISTENCY Database moves from one valid state to another. All constraints, triggers, and rules are satisfied.
Example: Account balance cannot be negative
- If transfer would make balance negative, transaction fails
- Database never contains invalid data
I - ISOLATION Concurrent transactions don't interfere with each other. Each transaction sees a consistent snapshot.
Example: Two transfers happening simultaneously
- Transfer 1: A → B
- Transfer 2: A → C Each sees correct balance, neither corrupts the other.
D - DURABILITY Once committed, data survives crashes. Written to non-volatile storage.
Example: After "COMMIT" returns success
- Server can crash
- Power can fail
- Data is still there after restart
3.2 Isolation Levels
Isolation Levels (Weakest to Strongest)
1. READ UNCOMMITTED Can read data that hasn't been committed yet.
Problem: Dirty Reads
- Transaction A updates row (not committed)
- Transaction B reads updated value
- Transaction A rolls back
- Transaction B has garbage data!
Use case: Almost never. Maybe analytics where accuracy doesn't matter.
2. READ COMMITTED Can only read committed data. But same query can return different results within transaction.
Problem: Non-Repeatable Reads
- Transaction A reads row (value = 100)
- Transaction B updates row to 200, commits
- Transaction A reads same row again (value = 200)
- Different values in same transaction!
Use case: Default in PostgreSQL, Oracle. Good for most applications.
3. REPEATABLE READ Same query always returns same result within transaction. But new rows can appear.
Problem: Phantom Reads
- Transaction A: SELECT * WHERE price < 100 (returns 10 rows)
- Transaction B: INSERT row with price = 50, commits
- Transaction A: SELECT * WHERE price < 100 (returns 11 rows!)
- New row appeared!
Use case: Default in MySQL. Financial reports within a transaction.
4. SERIALIZABLE Transactions execute as if they were serial (one after another). No anomalies possible.
Implementation: Locking or optimistic concurrency
Use case: Critical financial operations. Highest consistency, lowest throughput.
Comparison Table:
| Level | Dirty Read | Non-Repeatable | Phantom |
|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible |
| Read Committed | No | Possible | Possible |
| Repeatable Read | No | No | Possible |
| Serializable | No | No | No |
3.3 ACID Implementation
How Databases Achieve ACID
ATOMICITY: Write-Ahead Logging (WAL)
- Write intention to log FIRST
- Then modify actual data
- On crash: replay log to recover
Transaction starts
→ Write "BEGIN TX-123" to log
→ Write "UPDATE accounts SET balance=900 WHERE id=A" to log
→ Write "UPDATE accounts SET balance=1100 WHERE id=B" to log
→ Write "COMMIT TX-123" to log
→ Apply changes to data files
Crash before COMMIT? Transaction never happened.
Crash after COMMIT? Replay log on recovery.
ISOLATION: Locking or MVCC
Locking (Pessimistic):
- Acquire locks before reading/writing
- Other transactions wait
- Simple but can cause contention
MVCC - Multi-Version Concurrency Control (Optimistic):
- Keep multiple versions of each row
- Each transaction sees snapshot at start time
- Writers don't block readers
- Used by: PostgreSQL, MySQL InnoDB, Oracle
DURABILITY: fsync to disk
After COMMIT:
- Flush WAL to disk
- fsync() ensures data hits persistent storage
- Only then return success to client
Trade-off: fsync is slow (~10ms on HDD)
- Batch commits (group commit)
- Use fast storage (NVMe SSD)
Chapter 4: BASE Properties
4.1 BASE vs ACID
BASE: The Alternative to ACID
For distributed systems where ACID is too restrictive:
BA - BASICALLY AVAILABLE System guarantees availability (in CAP sense). Might return stale data, but always responds.
S - SOFT STATE State can change over time, even without input. Due to eventual consistency propagation.
E - EVENTUAL CONSISTENCY If no new updates, all replicas will eventually converge. No guarantee of when, but eventually consistent.
ACID vs BASE Comparison:
| ACID | BASE |
|---|---|
| Strong consistency | Eventual consistency |
| Pessimistic (lock early) | Optimistic (resolve later) |
| Complex transactions | Simple operations |
| Vertical scaling | Horizontal scaling |
| Traditional RDBMS | NoSQL databases |
When to Use Each:
ACID (Traditional databases):
- Financial transactions
- Inventory management
- User account management
- Any place where data integrity is critical
BASE (Distributed databases):
- Social media feeds
- Product catalogs
- Session stores
- Analytics data
- Any place where eventual consistency is acceptable
4.2 Eventual Consistency in Practice
Example: Social Media Post
User posts status update in US data center:
t=0ms: User clicks "Post"
t=10ms: US data center saves post
t=10ms: User sees their post (read-your-writes)
t=50ms: Replication to EU data center starts
t=150ms: EU data center receives post
During t=10ms to t=150ms:
- US users see the post
- EU users don't see the post yet
- System is inconsistent but available
After t=150ms:
- All users see the post
- System is eventually consistent
Consistency Guarantees (Strongest to Weakest):
-
STRONG CONSISTENCY - After write completes, all reads see new value.
-
READ-YOUR-WRITES - You always see your own writes. Others might not see them yet.
-
MONOTONIC READS - Once you see a value, you never see older values. Time doesn't go backward for you.
-
MONOTONIC WRITES - Your writes are applied in order. Write A before Write B → A applied before B.
-
EVENTUAL CONSISTENCY - Eventually, all replicas converge. No timing guarantee.
Implementing Read-Your-Writes:
Option 1: Sticky sessions - Route user to same replica that handled their write.
Option 2: Read from primary - For user's own data, always read from primary.
Option 3: Version tracking - Client tracks version of last write. Only accept reads with version >= last write.
Chapter 5: Consistency Models
5.1 Linearizability
Linearizability (Strict Serializability)
The strongest consistency model. Operations appear to happen instantaneously at some point between invocation and response.
Every operation has a "linearization point" in real time.
Client A: write(x=1) |-------|
Client B: read(x) |---|
Client C: write(x=2) |-------|
Client D: read(x) |---|
Time →
Linearizable execution:
write(x=1) completes
read(x) returns 1 (must see the write)
write(x=2) completes
read(x) returns 2
If read(x) started after write(x=1) completed,
it MUST return 1 or later value. Never 0.
Where It's Needed:
- Distributed locks (must see latest lock state)
- Leader election (must agree on leader)
- Unique ID generation (no duplicates)
Cost:
- Requires coordination
- Higher latency
- Lower throughput
5.2 Sequential Consistency
Sequential Consistency
Weaker than linearizability. All operations appear in SOME sequential order. Order is consistent with program order for each process. But doesn't have to match real-time order.
Client A: write(x=1), write(x=2)
Client B: read(x), read(x)
Valid sequential orderings:
write(x=1) → write(x=2) → read(x)=2 → read(x)=2 ✓
write(x=1) → read(x)=1 → write(x=2) → read(x)=2 ✓
read(x)=0 → write(x=1) → write(x=2) → read(x)=2 ✓
Invalid:
write(x=1) → write(x=2) → read(x)=2 → read(x)=1 ✗
(Violates program order: read=2 then read=1?)
write(x=2) → write(x=1) → ... ✗
(Violates program order: A wrote 1 before 2)
Difference from Linearizability:
- Linearizable: real-time ordering matters
- Sequential: only program order matters per process
5.3 Causal Consistency
Causal Consistency
Operations that are causally related must be seen in order. Concurrent operations can be seen in any order.
Causality: A "happens before" B if:
- A and B are in same process, A before B
- A is a write, B is a read that sees A
- Transitive: A→B and B→C means A→C
Example: Social Media Comments
Alice: posts "Hello!" (A)
Bob: sees Alice's post (causally after A)
Bob: comments "Hi Alice!" (B, causally after A)
Carol: reads the thread
Carol MUST see:
Alice's post before Bob's comment
(Because B is causally dependent on A)
But two independent comments?
Dave: comments "Nice!"
Eve: comments "Great!"
Carol might see Dave before Eve, or Eve before Dave.
These are concurrent, no causal relationship.
Implementation: Vector Clocks
Each node maintains vector of logical clocks: [Node1_time, Node2_time, Node3_time]
When sending message:
- Increment own clock
- Attach vector to message
When receiving:
- Update vector: max(local[i], received[i]) for each i
- Increment own clock
Compare vectors to determine causality:
- V1 < V2 if all V1[i] <= V2[i] and at least one V1[i] < V2[i]
- If neither V1 < V2 nor V2 < V1, they're concurrent
Part II: Scaling Concepts
Chapter 6: Scaling Strategies
6.1 Vertical vs Horizontal Scaling
VERTICAL SCALING (Scale Up)
Add more power to existing machine:
- More CPU cores
- More RAM
- Faster disks
- Better network
Before: 4 cores, 16GB RAM, 500GB HDD After: 64 cores, 512GB RAM, 4TB NVMe
Pros:
- Simple — no code changes
- No distributed system complexity
- ACID transactions easy
- Lower latency (no network hops)
Cons:
- Hardware limits (can't buy infinite CPU)
- Single point of failure
- Expensive (high-end hardware costs exponentially more)
- Downtime for upgrades
Use when:
- Starting out (don't prematurely distribute)
- Database with complex transactions
- Workload fits on one machine
HORIZONTAL SCALING (Scale Out)
Add more machines:
Before: After:
Server Load Balancer
handles all |
traffic -----------
| | |
S1 S2 S3
Pros:
- Near-infinite scaling
- Fault tolerance (one fails, others continue)
- Cost-effective (commodity hardware)
- No downtime for scaling
Cons:
- Distributed system complexity
- Data consistency challenges
- Network latency between nodes
- More operational overhead
Use when:
- Need high availability
- Workload exceeds single machine
- Traffic is unpredictable (auto-scale)
6.2 Stateless vs Stateful Services
STATELESS SERVICES
Server holds no client-specific state between requests. All state is external (database, cache, client).
Request 1 → Server A → Database
Request 2 → Server B → Database (any server can handle)
Request 3 → Server C → Database
Characteristics:
- Any server can handle any request
- Easy to scale (add more servers)
- Easy to recover (restart server, no data lost)
- Load balancer can use round-robin
Examples: REST API servers, Web servers (serving static content), Serverless functions
STATEFUL SERVICES
Server holds state that's needed for subsequent requests.
Client A → Server 1 (has A's session)
Client B → Server 2 (has B's session)
Client A → must go to Server 1!
Characteristics:
- Requests must route to specific server
- Harder to scale (session affinity needed)
- Failover is complex (state must be replicated)
Examples: WebSocket servers, Game servers, Database servers
Making Stateful Services Scalable:
Option 1: Externalize state - Store session in Redis/Memcached. Now servers are stateless.
Option 2: Sticky sessions - Load balancer routes same client to same server. Still have failover problems.
Option 3: State replication - Replicate state across servers. Complex but handles failover.
Option 4: Partition state - Each server owns subset of state. Route requests by partition key.
Chapter 7: Load Balancing
7.1 Load Balancing Algorithms
1. ROUND ROBIN Requests distributed in order: 1→A, 2→B, 3→C, 4→A, 5→B...
Pros: Simple, even distribution Cons: Ignores server capacity and current load Use: Homogeneous servers, similar request costs
2. WEIGHTED ROUND ROBIN Servers have weights: A(3), B(2), C(1) Distribution: A,A,A,B,B,C,A,A,A,B,B,C...
Pros: Accounts for different server capacities Cons: Static weights, doesn't adapt Use: Mixed hardware environments
3. LEAST CONNECTIONS Send to server with fewest active connections.
Pros: Adapts to actual load Cons: Doesn't account for request complexity Use: Long-lived connections, varying request times
4. WEIGHTED LEAST CONNECTIONS Least connections, but weighted by server capacity. Score = connections / weight Send to lowest score.
Use: Mixed hardware with varying connection times
5. IP HASH Hash client IP to determine server. Same client always goes to same server.
Pros: Session affinity without cookies Cons: Uneven if clients have different request rates Use: Stateful services, caching benefits
6. LEAST RESPONSE TIME Send to server with fastest recent response times.
Pros: Accounts for actual performance Cons: Needs continuous measurement Use: When response time is critical
7. RANDOM Randomly select a server.
Pros: Simple, stateless Cons: Can be uneven short-term Use: Large number of servers (law of large numbers)
7.2 Layer 4 vs Layer 7 Load Balancing
LAYER 4 (Transport Layer) LOAD BALANCING
Works at TCP/UDP level. Sees: IP addresses, ports Doesn't see: HTTP headers, URLs, cookies
How it works:
- Client connects to load balancer IP
- LB picks backend server
- Forwards TCP connection (NAT or DSR)
- All packets in connection go to same server
Pros:
- Very fast (minimal processing)
- Protocol agnostic (works with any TCP/UDP)
- Low latency
Cons:
- Can't route based on content
- No SSL termination
- Limited health checks
Examples: AWS NLB, HAProxy (TCP mode), LVS
LAYER 7 (Application Layer) LOAD BALANCING
Works at HTTP/HTTPS level. Sees: URLs, headers, cookies, body
How it works:
- Client connects to load balancer
- LB terminates HTTP connection
- LB inspects request (URL, headers)
- LB opens new connection to chosen backend
- Forwards request, returns response
Pros:
- Content-based routing (URL, header)
- SSL termination (offload from backends)
- Request modification (add headers)
- Better health checks (check HTTP response)
- Caching, compression, rate limiting
Cons:
- More CPU intensive
- Higher latency
- Must understand protocol
Examples: AWS ALB, nginx, HAProxy (HTTP mode), Envoy
Choosing Between Them:
Use L4 when:
- Need raw performance
- Non-HTTP protocols (databases, custom protocols)
- SSL passthrough required
Use L7 when:
- Need content-based routing
- Want SSL termination
- Need request inspection/modification
- HTTP-specific features (websocket upgrade, gRPC)
Chapter 8: Sharding and Partitioning
8.1 Partitioning Strategies
HORIZONTAL PARTITIONING (Sharding)
Split rows across multiple databases. Each shard has same schema, different data.
Example: Users table with 100M rows
Shard 1: Users 1 - 25M
Shard 2: Users 25M - 50M
Shard 3: Users 50M - 75M
Shard 4: Users 75M - 100M
1. RANGE PARTITIONING
Partition by value ranges.
user_id 1-1M → Shard A
user_id 1M-2M → Shard B
user_id 2M-3M → Shard C
Pros:
- Range queries efficient (users 100-200 on same shard)
- Easy to understand
Cons:
- Hot spots (new users always hit latest shard)
- Uneven distribution if ranges are unequal
2. HASH PARTITIONING
Partition by hash of key.
shard = hash(user_id) % num_shards
hash(1) % 4 = 1 → Shard B
hash(2) % 4 = 2 → Shard C
hash(3) % 4 = 3 → Shard D
hash(4) % 4 = 0 → Shard A
Pros:
- Even distribution (if hash is good)
- No hot spots
Cons:
- Range queries hit all shards
- Adding shards requires rehashing (use consistent hashing!)
3. DIRECTORY-BASED PARTITIONING
Lookup table maps keys to shards.
Lookup Service:
user_1 → Shard A
user_2 → Shard B
user_3 → Shard A
...
Pros:
- Flexible placement
- Can move data easily
Cons:
- Lookup service is bottleneck/SPOF
- Extra hop for every query
4. GEOGRAPHIC PARTITIONING
Partition by location for data locality.
US users → US data center
EU users → EU data center
Asia users → Asia data center
Pros:
- Low latency for users
- Data sovereignty compliance
Cons:
- Complex for global data
- Cross-region queries expensive
8.2 Consistent Hashing
The Problem with Simple Hashing
Simple: shard = hash(key) % N
Add a server (N=4 → N=5):
hash(key) % 4 = 2 → Shard C
hash(key) % 5 = 3 → Shard D (different!)
Almost ALL keys move to different shards! Massive data migration, cache invalidation.
Consistent Hashing
Imagine a ring (0 to 2^32):
0
/|\
/ | \
/ | \
270 | 90
\ | /
\ | /
\|/
180
Place servers on ring (by hashing server name):
- hash("server-a") = 45 → position 45
- hash("server-b") = 120 → position 120
- hash("server-c") = 200 → position 200
- hash("server-d") = 300 → position 300
Place keys on ring, assign to next server clockwise:
- hash("user-123") = 100 → server-b (next after 100 is 120)
- hash("user-456") = 250 → server-d (next after 250 is 300)
Adding a Server
Add server-e at position 150:
Before: Keys 120-200 → server-c After: Keys 120-150 → server-e Keys 150-200 → server-c
Only keys between server-b and server-e move! About 1/N of keys move instead of all keys.
Virtual Nodes
Problem: Servers might be unevenly distributed on ring.
Solution: Each physical server gets multiple positions.
server-a → positions 45, 160, 280 (3 virtual nodes)
server-b → positions 90, 200, 320 (3 virtual nodes)
More even distribution, and bigger servers can have more virtual nodes.
Used By:
- Amazon DynamoDB
- Apache Cassandra
- Memcached clients
- Content Delivery Networks
Chapter 9: Replication
9.1 Replication Strategies
SINGLE-LEADER REPLICATION
One primary (leader), multiple secondaries (followers). Writes go to leader, replicate to followers. Reads can go to any node.
Client
|
v
Leader ----→ Follower 1
| |
| +--→ Follower 2
| |
| +--→ Follower 3
|
(writes) (reads can go anywhere)
SYNCHRONOUS REPLICATION
Leader waits for followers to confirm before returning success.
Client → Leader → Followers (wait for ACK) → Success to Client
Pros:
- Strong consistency
- No data loss on leader failure
Cons:
- High latency (wait for slowest follower)
- Availability impact (follower down = writes blocked)
ASYNCHRONOUS REPLICATION
Leader returns success immediately, replicates in background.
Client → Leader → Success to Client
↓
(async) Followers
Pros:
- Low latency
- High availability
Cons:
- Potential data loss (leader fails before replication)
- Followers may be behind (replication lag)
SEMI-SYNCHRONOUS REPLICATION
Wait for at least one follower to confirm. Leader confirms after 1+ followers ACK. Other followers replicate async.
Balance between consistency and performance. Used by: MySQL semi-sync replication
9.2 Multi-Leader Replication
MULTI-LEADER (Multi-Master) REPLICATION
Multiple nodes accept writes. Changes replicate between all leaders.
Leader A <-----> Leader B <-----> Leader C
| | |
Reads Reads Reads
Use cases:
- Multi-datacenter (leader per datacenter)
- Offline-capable applications (each device is a leader)
- Collaborative editing
Conflict Resolution
What if two leaders modify same record simultaneously?
Leader A: UPDATE user SET name = 'Alice'
Leader B: UPDATE user SET name = 'Bob'
Conflict! Which one wins?
Strategies:
1. LAST-WRITE-WINS (LWW) Timestamp each write, highest timestamp wins.
Simple but can lose data. Time sync issues (clock skew).
2. VERSION VECTORS Track version per node. Detect conflicts, let application resolve.
User A: [A:1, B:0] name = 'Alice'
User B: [A:0, B:1] name = 'Bob'
Neither dominates → conflict!
3. MERGE Combine both values somehow.
For shopping cart: union of items For counters: sum of increments For text: operational transformation
4. CUSTOM RESOLUTION Application-specific logic.
"Most recent login location wins" "Merge address fields, flag for review"
9.3 Leaderless Replication
LEADERLESS REPLICATION
No single leader. Client writes to multiple nodes. Reads from multiple nodes, use quorum.
Client writes to N nodes
Client reads from N nodes
R + W > N ensures overlap (quorum)
N = 3 (total replicas)
W = 2 (write to 2 nodes)
R = 2 (read from 2 nodes)
W + R = 4 > 3 → guaranteed to read latest write
Example: Write x=5, then read
Write x=5 to nodes A and B (W=2)
Node A: x=5
Node B: x=5
Node C: x=1 (stale, missed write)
Read from nodes B and C (R=2)
Get x=5 from B, x=1 from C
Return x=5 (highest version)
Quorum Configuration
- R=1, W=N: Fast reads, slow writes, read may be stale
- R=N, W=1: Slow reads, fast writes, write may be lost
- R=W=N/2+1: Balanced, typical configuration
N=3, R=2, W=2: Good balance N=5, R=3, W=3: More fault tolerant
Used By:
- Amazon DynamoDB (configurable)
- Apache Cassandra
- Riak
Sloppy Quorum and Hinted Handoff
If designated node is down, write to another node temporarily. "Hint" stored that data belongs elsewhere. When original node recovers, data transferred back.
Improves availability but weakens consistency guarantee.
Part III: Data Concepts
Chapter 10: Data Storage Patterns
10.1 SQL vs NoSQL
SQL (Relational) DATABASES
Structured data with fixed schema. Tables, rows, columns. SQL query language. ACID transactions.
Examples: PostgreSQL, MySQL, Oracle, SQL Server
Best for:
- Complex queries with JOINs
- Transactions across multiple tables
- Strong consistency requirements
- Well-defined schema
NoSQL DATABASES
Various data models, flexible schema. Optimized for specific access patterns.
Types:
1. KEY-VALUE STORES Simple: key → value Examples: Redis, DynamoDB, Memcached Use: Caching, session storage, simple lookups
2. DOCUMENT STORES Key → JSON/BSON document Examples: MongoDB, CouchDB, Firestore Use: Content management, catalogs, user profiles
3. COLUMN-FAMILY (Wide Column) Rows with dynamic columns, column families Examples: Cassandra, HBase, ScyllaDB Use: Time series, IoT, large-scale analytics
4. GRAPH DATABASES Nodes and relationships Examples: Neo4j, Amazon Neptune, JanusGraph Use: Social networks, recommendation, fraud detection
Comparison:
| Aspect | SQL | NoSQL |
|---|---|---|
| Schema | Fixed, defined | Flexible, dynamic |
| Scaling | Vertical (mainly) | Horizontal (designed for) |
| Consistency | Strong (ACID) | Eventual (BASE) typically |
| Query Language | SQL | Varies by database |
| JOINs | Native support | Usually application-side |
| Transactions | Multi-table | Usually single-document |
10.2 Indexing
INDEX TYPES
1. B-TREE INDEX (Most Common)
Balanced tree structure. Good for: equality, range queries, sorting
Query: WHERE age > 25 AND age < 50
Without index: Scan all rows (O(n)) With B-tree: Navigate tree (O(log n))
2. HASH INDEX
Hash table structure. Good for: exact equality only
Query: WHERE id = 12345
Hash id → find bucket → get value (O(1))
Cannot do: range queries, sorting
3. INVERTED INDEX (Full-Text Search)
Maps words to documents containing them.
"hello" → [doc1, doc3, doc7]
"world" → [doc1, doc5]
Query: "hello world"
Intersection: [doc1]
Used by: Elasticsearch, PostgreSQL full-text
4. BITMAP INDEX
Bit array for each value. Good for: low-cardinality columns
gender='M': 1,0,1,1,0,1,0,0...
gender='F': 0,1,0,0,1,0,1,1...
Fast for: COUNT, AND/OR operations
5. GEOSPATIAL INDEX (R-Tree, Quad-tree)
Indexes 2D/3D coordinates.
Query: Find all restaurants within 5km
Used by: PostGIS, MongoDB, Redis
COMPOSITE INDEX
Index on multiple columns. Order matters!
INDEX (country, city, street)
Efficient for:
WHERE country = 'US'
WHERE country = 'US' AND city = 'NYC'
WHERE country = 'US' AND city = 'NYC' AND street = 'Broadway'
NOT efficient for:
WHERE city = 'NYC' (leftmost column not used)
WHERE country = 'US' AND street = 'Broadway' (gap in columns)
COVERING INDEX
Index contains all columns needed for query. No need to access table data.
INDEX (user_id, name, email)
Query: SELECT name, email WHERE user_id = 123
Index has all data → no table lookup → faster!
10.3 Write-Ahead Log and LSM Trees
WRITE-AHEAD LOG (WAL)
Before modifying data, write to append-only log. Used for durability and crash recovery.
Write path:
- Append operation to WAL (sequential write, fast)
- fsync WAL to disk
- Apply to in-memory data structures
- Periodically flush to data files (checkpoint)
Recovery:
- Load last checkpoint
- Replay WAL from checkpoint
- System is consistent
LSM TREE (Log-Structured Merge Tree)
Optimized for write-heavy workloads. Used by: LevelDB, RocksDB, Cassandra, HBase
Structure:
MemTable (in-memory, sorted)
|
| (flush when full)
v
Level 0 SSTables (Sorted String Tables on disk)
|
| (compaction merges files)
v
Level 1 SSTables
|
v
Level N SSTables
Write path:
- Write to MemTable (memory, fast)
- When MemTable full, flush to Level 0 SSTable
- Background compaction merges and sorts levels
Read path:
- Check MemTable
- Check each level (newest to oldest)
- Use bloom filters to skip files without key
Pros:
- Excellent write performance (sequential I/O)
- Compression friendly
Cons:
- Read amplification (check multiple levels)
- Write amplification (compaction rewrites data)
- Space amplification (data exists in multiple levels)
Chapter 11: Caching Concepts
11.1 Cache Patterns
CACHING STRATEGIES
1. CACHE-ASIDE (Lazy Loading)
Application manages cache explicitly.
Read:
data = cache.get(key)
if data is None:
data = db.query(key)
cache.set(key, data)
return data
Write:
db.update(key, data)
cache.delete(key) # Invalidate
Pros: Only requested data is cached Cons: Cache miss penalty, potential stale data
2. READ-THROUGH
Cache sits in front of database. Cache handles misses automatically.
Read:
data = cache.get(key) # Cache fetches from DB on miss
Pros: Simpler application code Cons: Cache library must know how to query DB
3. WRITE-THROUGH
Writes go to cache, cache writes to database. Synchronous — returns after both succeed.
Write:
cache.set(key, data) # Also writes to DB
Pros: Cache always consistent with DB Cons: Higher write latency
4. WRITE-BEHIND (Write-Back)
Write to cache, asynchronously write to database.
Write:
cache.set(key, data) # Returns immediately
# Background process writes to DB later
Pros: Very fast writes Cons: Data loss risk if cache fails before flush
5. REFRESH-AHEAD
Proactively refresh cache before expiration.
If item accessed and TTL < threshold: Async refresh in background
Pros: Reduces cache miss latency Cons: May refresh unused data
11.2 Cache Eviction Policies
EVICTION POLICIES
When cache is full, which item to remove?
1. LRU (Least Recently Used)
Remove item that hasn't been accessed longest.
Access order: A, B, C, D, A, B Evict: C (least recently used)
Pros: Good general-purpose policy Cons: One-time scans pollute cache
Used by: Most caches by default
2. LFU (Least Frequently Used)
Remove item with fewest accesses.
Access counts: A(10), B(2), C(5), D(1) Evict: D (least frequent)
Pros: Keeps popular items Cons: Old popular items never evicted
Variant: LFU with aging (decay counts over time)
3. FIFO (First In, First Out)
Remove oldest item.
Insert order: A, B, C, D Evict: A (first inserted)
Pros: Simple to implement Cons: Doesn't consider access patterns
4. TTL (Time To Live)
Items expire after fixed time.
cache.set(key, value, ttl=300) # Expires in 5 minutes
Pros: Bounds staleness Cons: Popular items may expire unnecessarily
5. RANDOM
Randomly select item to evict.
Pros: Simple, no metadata overhead Cons: May evict important items
Real-World Combinations:
- Redis: Multiple policies (LRU, LFU, TTL, random)
- Memcached: LRU with slab allocation
- CPU caches: Pseudo-LRU (approximate for speed)
11.3 Cache Problems
CACHE PROBLEMS AND SOLUTIONS
1. CACHE STAMPEDE (Thundering Herd)
Problem:
- Popular key expires
- 100 requests arrive simultaneously
- All miss cache, all query database
- Database overwhelmed
Solutions:
a) Locking: Only one request fetches, others wait
if not cache.get(key):
if cache.acquire_lock(key):
data = db.query()
cache.set(key, data)
cache.release_lock(key)
else:
wait_and_retry()
b) Probabilistic early refresh: Refresh before expiration with some probability. Spreads refresh load over time.
2. CACHE PENETRATION
Problem:
- Query for non-existent key (e.g., user ID 999999999)
- Cache always misses
- Database queried every time
- Attacker can DoS with fake keys
Solutions:
a) Cache null values:
if db.query(key) is None:
cache.set(key, NULL, ttl=60)
b) Bloom filter: Check bloom filter before database. If not in bloom filter, definitely doesn't exist.
3. CACHE BREAKDOWN
Problem:
- Single hot key expires
- Massive traffic hits database
Solutions: a) Never expire hot keys b) Background refresh before expiration c) Mutex lock (only one request refreshes)
4. HOT KEY
Problem:
- One key gets disproportionate traffic
- Single cache node overwhelmed
Solutions:
a) Local caching: Cache hot keys in application memory
b) Key replication: Store same key on multiple nodes key_1, key_2, key_3 all store same data Hash to different nodes
c) Read replicas: Multiple cache replicas for reads
Part IV: Messaging Concepts
Chapter 12: Message Queues and Streams
12.1 Queue vs Stream
MESSAGE QUEUE (e.g., RabbitMQ, SQS)
Message consumed by ONE consumer, then deleted.
Producer → [Queue] → Consumer
Messages: A, B, C, D
Consumer 1 takes: A, C
Consumer 2 takes: B, D
After consumption, messages are gone.
Characteristics:
- Competing consumers (work distribution)
- At-least-once delivery (with acknowledgment)
- No replay (message deleted after consume)
- Good for: Task queues, work distribution
MESSAGE STREAM (e.g., Kafka, Kinesis)
Messages retained, multiple consumers can read.
Producer → [Stream/Log] → Consumer Group A
→ Consumer Group B
Messages: A, B, C, D (stored with offsets)
Consumer Group A: reads A, B, C, D at their pace
Consumer Group B: reads A, B, C, D independently
Messages retained for configured period.
Characteristics:
- Multiple consumer groups (fan-out)
- Replay capability (rewind to past offset)
- Ordered within partition
- Good for: Event sourcing, real-time analytics, audit logs
Comparison:
| Feature | Queue | Stream |
|---|---|---|
| Consumption | Destructive | Non-destructive |
| Consumer groups | Competing | Independent |
| Replay | No | Yes |
| Ordering | Per-queue | Per-partition |
| Retention | Until consumed | Time/size based |
| Use case | Task distribution | Event streaming |
12.2 Delivery Guarantees
DELIVERY SEMANTICS
1. AT-MOST-ONCE
Message may be lost, never delivered twice.
Producer sends → No acknowledgment If lost, gone forever.
Use case: Metrics, logs where loss is acceptable Trade-off: Fastest, least reliable
2. AT-LEAST-ONCE
Message will be delivered, possibly multiple times.
Producer sends → Wait for ACK If no ACK, retry May deliver twice if ACK lost but message received.
Use case: Most applications (with idempotent consumers) Trade-off: Reliable but needs deduplication
3. EXACTLY-ONCE
Message delivered exactly once.
Very hard to achieve in distributed systems!
Approaches: a) Idempotent producer + idempotent consumer Producer retries with same ID Consumer tracks processed IDs, ignores duplicates
b) Transactional messaging Kafka transactions: atomic produce + consume
Use case: Financial transactions, inventory Trade-off: Most complex, some performance cost
Implementing Exactly-Once (Practical):
Producer side (idempotent):
message_id = generate_unique_id()
send(message_id, payload) # Retry with same ID
Consumer side (idempotent):
if message_id in processed_set:
return # Already processed
process(message)
processed_set.add(message_id)
With transactional outbox:
BEGIN TRANSACTION
INSERT INTO outbox (message)
UPDATE business_table
COMMIT
-- Separate process reads outbox, publishes to queue
Chapter 13: Event-Driven Architecture
13.1 Event Sourcing
EVENT SOURCING
Store events, not current state. Current state = replay all events.
Traditional (State-based):
Account table:
id=123, balance=1000, last_updated=2024-01-15
Problems:
- Lost history of how we got here
- Can't answer "what was balance on Jan 1?"
Event Sourcing:
Events table:
AccountCreated(id=123, initial=0)
Deposited(id=123, amount=500)
Deposited(id=123, amount=700)
Withdrawn(id=123, amount=200)
Current balance:
0 + 500 + 700 - 200 = 1000
Balance on Jan 1:
Replay events up to Jan 1
Benefits:
- Complete audit trail
- Time travel (state at any point)
- Debugging (replay events to reproduce bugs)
- Event replay (rebuild read models, fix bugs)
Challenges:
- Event schema evolution
- Rebuild time for long event streams
- Snapshots needed for performance
CQRS (Command Query Responsibility Segregation)
Separate write model (events) from read model (projections).
Commands → Write Model (Event Store)
|
| (project events)
v
Read Model (Optimized for queries)
|
Queries → -------+
Write: Append event to event store Read: Query from denormalized read model
Read model can be:
- Relational database with joins pre-computed
- Search index (Elasticsearch)
- Cache (Redis)
- Multiple different views of same data
13.2 Saga Pattern
SAGA PATTERN
Manage distributed transactions across services. Instead of ACID, use sequence of local transactions with compensations.
Example: Order Placement
1. Order Service: Create order (PENDING)
2. Payment Service: Charge customer
3. Inventory Service: Reserve items
4. Shipping Service: Schedule delivery
5. Order Service: Mark order (CONFIRMED)
If step 3 fails (out of stock):
- Compensate step 2: Refund payment
- Compensate step 1: Cancel order
Saga Execution Styles
1. CHOREOGRAPHY
Each service listens for events, decides what to do.
Order Created → Payment listens → Payment Charged event
Payment Charged → Inventory listens → Items Reserved event
Items Reserved → Shipping listens → Delivery Scheduled event
Pros: Decoupled, simple for few steps Cons: Hard to track, distributed logic
2. ORCHESTRATION
Central orchestrator controls the flow.
Orchestrator:
1. Tell Order: Create order
2. Tell Payment: Charge
3. Tell Inventory: Reserve
4. Tell Shipping: Schedule
5. Tell Order: Confirm
If failure:
Execute compensations in reverse order
Pros: Clear flow, easy to track/debug Cons: Orchestrator is potential bottleneck
Saga Compensations
Each action needs a compensation:
| Action | Compensation |
|---|---|
| Create order | Cancel order |
| Charge payment | Refund payment |
| Reserve inventory | Release inventory |
| Schedule shipping | Cancel shipping |
Compensations must be idempotent! May be called multiple times.
Part V: Reliability Concepts
Chapter 14: Fault Tolerance
14.1 Failure Types
FAILURE MODES
1. CRASH FAILURE
Node stops and doesn't respond.
Detection: Heartbeat timeout Recovery: Failover to replica
Example: Process killed, server power off
2. OMISSION FAILURE
Node fails to send or receive messages.
Send omission: Node doesn't send response Receive omission: Node doesn't receive request
Often indistinguishable from crash.
3. TIMING FAILURE
Response outside acceptable time window.
Example: Response takes 30 seconds instead of 100ms
For caller: Looks like failure Challenge: Did request succeed or not?
4. BYZANTINE FAILURE
Node behaves arbitrarily (including maliciously).
Example: Node sends different data to different nodes
Hardest to handle. Requires 3f+1 nodes to tolerate f failures.
Relevant for: Blockchain, hostile environments
Detection Methods:
Heartbeat:
- Node periodically sends "I'm alive" message.
- Missing heartbeats → suspected failure.
- Challenge: Network partition vs actual failure
Timeout:
- Request times out → suspected failure.
- Challenge: Slow vs failed
Lease:
- Node holds lease (timed permission).
- Must renew before expiration.
- No renewal → other nodes can take over.
- Used by: Distributed locks, leader election
14.2 Redundancy Patterns
REDUNDANCY PATTERNS
1. ACTIVE-PASSIVE (Hot Standby)
Primary handles all traffic. Standby replicates data, takes over on failure.
Primary ◄═══════════════► Standby
(active) replication (passive)
↑
traffic
Failover: Promote standby to primary
Pros: Simple, standby has consistent data Cons: Standby resources underutilized
2. ACTIVE-ACTIVE
Multiple nodes handle traffic simultaneously.
Node A ◄════════════► Node B
(active) sync/async (active)
↑ ↑
traffic traffic
Pros: Better resource utilization, no failover needed Cons: Conflict resolution, data consistency
3. N+1 REDUNDANCY
N nodes handle normal load. +1 spare for failures.
Example: 3 nodes needed for traffic, deploy 4. One can fail without impact.
4. N+2 REDUNDANCY
One failure + one for maintenance.
Example: 3+2 = 5 nodes. Handles one failure while another is being updated.
5. 2N REDUNDANCY
Double everything.
Example: Need 3 nodes, deploy 6. Half can fail without impact.
Used for: Critical systems, disaster recovery
14.3 Circuit Breaker Pattern
CIRCUIT BREAKER
Prevent cascading failures by failing fast.
States:
- CLOSED: Normal operation, requests pass through
- OPEN: Failure threshold exceeded, requests fail immediately
- HALF-OPEN: Test if downstream recovered
State Machine:
CLOSED ─────────────────────► OPEN
(failures > threshold)
OPEN ─────────────────────────► HALF-OPEN
(timeout expires)
HALF-OPEN ────────────────────► CLOSED
(test request succeeds)
HALF-OPEN ────────────────────► OPEN
(test request fails)
Implementation:
class CircuitBreaker:
def __init__(self, threshold=5, timeout=30):
self.failure_count = 0
self.threshold = threshold
self.timeout = timeout
self.state = CLOSED
self.last_failure_time = None
def call(self, func):
if self.state == OPEN:
if time.now() - self.last_failure_time > self.timeout:
self.state = HALF_OPEN
else:
raise CircuitOpenError()
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
self.failure_count = 0
self.state = CLOSED
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.now()
if self.failure_count >= self.threshold:
self.state = OPEN
BULKHEAD PATTERN
Isolate failures to prevent spreading. Like ship bulkheads that contain flooding.
Service A ─────► [Pool 1] ─────► Dependency X
Service A ─────► [Pool 2] ─────► Dependency Y
Pool 1 exhausted (X is slow)? Pool 2 still works (Y unaffected).
Implementation:
- Separate thread pools per dependency
- Separate connection pools
- Separate rate limits
Chapter 15: SLIs, SLOs, and SLAs
15.1 Definitions
SERVICE LEVEL TERMINOLOGY
SLI (Service Level Indicator) A metric that measures service behavior.
Examples:
- Request latency (p99 < 100ms)
- Error rate (errors / total requests)
- Availability (uptime / total time)
- Throughput (requests per second)
SLO (Service Level Objective) Target value for an SLI. Internal goal.
Examples:
- "99.9% of requests complete in < 100ms"
- "Error rate < 0.1%"
- "99.95% availability"
SLA (Service Level Agreement) Contract with consequences. External commitment.
Examples:
- "99.9% uptime or customer gets credit"
- "Response within 4 hours or penalty"
SLA should be less strict than SLO! (Buffer for unexpected issues)
Relationship:
SLI ──measures──► SLO ──commits──► SLA
SLI: "Our p99 latency is 85ms"
SLO: "We aim for p99 < 100ms"
SLA: "We guarantee p99 < 200ms" (with penalties)
15.2 Error Budgets
ERROR BUDGET
If SLO is 99.9% availability: Error budget = 0.1% downtime allowed
Per month (30 days): 0.1% × 30 × 24 × 60 = 43.2 minutes of downtime allowed
Error Budget Policy:
- Budget remaining → Ship fast, take risks
- Budget exhausted → Focus on reliability, freeze features
Budget at 80%: Normal development
Budget at 50%: Increased caution
Budget at 20%: Reliability focus
Budget at 0%: Feature freeze, fix reliability
Calculating Error Budget:
Monthly error budget for 99.9% availability:
Total minutes: 30 × 24 × 60 = 43,200 minutes
Allowed downtime: 43,200 × 0.001 = 43.2 minutes
Incidents this month:
- Deploy failure: 15 minutes
- Database issue: 10 minutes
Budget used: 25 minutes
Budget remaining: 18.2 minutes (42% left)
Monitoring Error Budget:
Track:
- Current error budget consumption rate
- Projected budget at end of period
- Alert when consumption rate too high
If burning budget faster than expected:
- Investigate
- Maybe pause risky deployments
- Focus on reliability
Summary
Key Concepts at a Glance
THEOREMS AND TRADE-OFFS:
CAP Theorem:
- Consistency, Availability, Partition tolerance
- During partition: choose C or A
- CP: refuse requests if can't guarantee consistency
- AP: serve requests with possibly stale data
PACELC:
- Extends CAP: what about when no partition?
- EL: prioritize low latency
- EC: prioritize consistency
ACID (Databases):
- Atomicity: all or nothing
- Consistency: valid state to valid state
- Isolation: transactions don't interfere
- Durability: committed = permanent
BASE (Distributed):
- Basically Available
- Soft state
- Eventually consistent
CONSISTENCY MODELS (Strong to Weak):
Linearizability → Sequential → Causal → Eventual
SCALING:
- Vertical: bigger machine (simple, limited)
- Horizontal: more machines (complex, unlimited)
- Stateless: easy to scale, any server works
- Stateful: needs session affinity or externalized state
SHARDING:
- Range: predictable, hot spots possible
- Hash: even distribution, no range queries
- Consistent hashing: minimal redistribution on changes
REPLICATION:
- Single-leader: writes to one, reads from many
- Multi-leader: writes anywhere, conflict resolution needed
- Leaderless: quorum reads/writes (R + W > N)
- Sync: consistent but slow
- Async: fast but possible data loss
CACHING:
- Patterns: cache-aside, read-through, write-through, write-behind
- Eviction: LRU, LFU, TTL
- Problems: stampede, penetration, breakdown, hot keys
MESSAGING:
- Queue: one consumer, message deleted
- Stream: multiple consumers, replay possible
- Delivery: at-most-once, at-least-once, exactly-once
- Saga: distributed transactions with compensations
RELIABILITY:
- Circuit breaker: fail fast when downstream is down
- Bulkhead: isolate failures
- Redundancy: N+1, N+2, 2N
- SLI: metric (latency, error rate)
- SLO: target (99.9% availability)
- SLA: contract (penalties for breach)
- Error budget: allowed failure within SLO
📚 Further Reading
Books
- Designing Data-Intensive Applications by Martin Kleppmann
- The definitive guide to distributed systems concepts
- Database Internals by Alex Petrov
- Deep dive into storage engines and distributed databases
- Site Reliability Engineering by Google
- SLIs, SLOs, error budgets, and operational practices
Papers
- CAP Twelve Years Later by Eric Brewer
- Harvest, Yield, and Scalable Tolerant Systems by Fox & Brewer
- Dynamo: Amazon's Highly Available Key-value Store
- Spanner: Google's Globally-Distributed Database
Online Resources
- Distributed Systems for Fun and Profit (free book)
- jepsen.io - Distributed systems correctness testing
- Martin Kleppmann's Blog - https://martin.kleppmann.com/
- High Scalability Blog - http://highscalability.com/
Videos
- MIT 6.824: Distributed Systems (free course)
- Martin Kleppmann's Conference Talks
End of Week 0 — Part 6: System Design Terminology and Theory
Week 0 Complete! You now have a solid foundation in:
- System design methodology (Part 1)
- Infrastructure building blocks (Part 2)
- Back-of-envelope estimation (Part 3)
- Networking fundamentals (Part 4)
- Operating system concepts (Part 5)
- Theory and terminology (Part 6)
Ready to dive into Week 1: Foundations of Scale — where we apply these concepts to real system designs!