Foundation
Week 0 — Part 5: Operating System Concepts for System Design
System Design Mastery Series
Preface
Every distributed system runs on operating systems. When you say "the server handles 10,000 requests per second," you're relying on OS concepts you might not consciously think about.
Your application code sits on top of:
┌─────────────────────────────────────────────────────────────────┐
│ YOUR APPLICATION │
├─────────────────────────────────────────────────────────────────┤
│ Runtime (Python, JVM, Go, Node.js) │
├─────────────────────────────────────────────────────────────────┤
│ System Libraries (libc) │
├─────────────────────────────────────────────────────────────────┤
│ Operating System Kernel │
│ • Process Management • Memory Management │
│ • File Systems • I/O Subsystem │
│ • Networking Stack • Device Drivers │
├─────────────────────────────────────────────────────────────────┤
│ Hardware │
└─────────────────────────────────────────────────────────────────┘
Understanding OS concepts helps you:
- Tune your systems for performance
- Debug production issues
- Make informed architecture decisions
- Answer interview questions confidently
This document covers everything about operating systems you need for system design interviews.
Part I: Processes and Threads
Chapter 1: Processes
1.1 What Is a Process?
PROCESS = Running instance of a program
A process has:
┌────────────────────────────────────────────────────────────────────────┐
│ PROCESS │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ MEMORY LAYOUT │ │
│ │ │ │
│ │ High addresses │ │
│ │ ┌──────────────────────┐ │ │
│ │ │ Stack │ ← Local variables, function calls │ │
│ │ │ ↓ │ Grows downward │ │
│ │ ├──────────────────────┤ │ │
│ │ │ ↑ │ │ │
│ │ │ Heap │ ← Dynamic allocation (malloc/new) │ │
│ │ │ │ Grows upward │ │
│ │ ├──────────────────────┤ │ │
│ │ │ BSS │ ← Uninitialized global variables │ │
│ │ ├──────────────────────┤ │ │
│ │ │ Data │ ← Initialized global variables │ │
│ │ ├──────────────────────┤ │ │
│ │ │ Text │ ← Program code (read-only) │ │
│ │ └──────────────────────┘ │ │
│ │ Low addresses │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ + Process ID (PID) │
│ + Open file descriptors │
│ + Environment variables │
│ + CPU register state │
│ + Scheduling priority │
│ │
└────────────────────────────────────────────────────────────────────────┘
1.2 Process States
PROCESS LIFECYCLE
┌─────────────────┐
│ NEW │
│ (Created) │
└────────┬────────┘
│ Admitted
▼
┌─────────────────────────────────────────────────────┐
│ │
│ ┌─────────────┐ Scheduled ┌─────────────┐ │
│ │ │ ──────────────▶ │ │ │
│ │ READY │ │ RUNNING │ │
│ │ (Waiting │ ◀────────────── │ (Executing) │ │
│ │ for CPU) │ Preempted │ │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ │ ┌─────────────┐ │ │
│ │ │ │ │ │
│ └────────▶│ WAITING │◀──────┘ │
│ I/O complete │ (Blocked) │ I/O request │
│ │ │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────┘
│
│ Exit
▼
┌─────────────────┐
│ TERMINATED │
│ (Zombie) │
└─────────────────┘
STATES EXPLAINED:
NEW: Process being created
READY: Waiting for CPU time
RUNNING: Currently executing on CPU
WAITING: Blocked on I/O or event
TERMINATED: Finished, waiting for parent to read exit status
1.3 Process Creation
FORK AND EXEC (Unix/Linux)
Parent Process Child Process
│
│ fork()
│────────────────────────────────▶ │
│ │ Created as copy of parent
│ │ Gets new PID
│ │ Shares open file descriptors
│ │
│ (continues) │ exec("/bin/program")
│ │ Replaces memory with new program
│ │
│ wait() │ (runs program)
│ ... │
│ ◀──── exit status ──────────── │ exit(0)
│
FORK() RETURN VALUES:
In parent: Returns child's PID (positive number)
In child: Returns 0
On error: Returns -1
COPY-ON-WRITE (COW):
- Child doesn't immediately copy parent's memory
- Pages marked read-only, shared between parent/child
- Only copied when either tries to write
- Makes fork() very fast even for large processes
1.4 Inter-Process Communication (IPC)
┌─────────────────────────────────────────────────────────────────────────┐
│ IPC MECHANISMS │
│ │
│ 1. PIPES (Anonymous) │
│ ───────────────────────────────────────────────────────────────────── │
│ Process A ───[write]───▶ Pipe ───[read]───▶ Process B │
│ │
│ - Unidirectional │
│ - Parent-child communication │
│ - Example: shell pipes (ls | grep foo) │
│ │
│ 2. NAMED PIPES (FIFOs) │
│ ───────────────────────────────────────────────────────────────────── │
│ Any Process ───▶ /tmp/myfifo ───▶ Any Process │
│ │
│ - Has filesystem name │
│ - Unrelated processes can communicate │
│ │
│ 3. MESSAGE QUEUES │
│ ───────────────────────────────────────────────────────────────────── │
│ Process A ──[msg]──▶ │ Queue │ ──[msg]──▶ Process B │
│ │
│ - Structured messages with types │
│ - Persist until read │
│ - System V or POSIX variants │
│ │
│ 4. SHARED MEMORY │
│ ───────────────────────────────────────────────────────────────────── │
│ Process A ─────┐ ┌───── Process B │
│ ▼ ▼ │
│ ┌───────────────┐ │
│ │ Shared Memory │ │
│ │ Region │ │
│ └───────────────┘ │
│ │
│ - Fastest IPC (no kernel involvement after setup) │
│ - Requires synchronization (mutexes, semaphores) │
│ - Used by: databases, Redis │
│ │
│ 5. SOCKETS │
│ ───────────────────────────────────────────────────────────────────── │
│ Process A ◀══[TCP/UDP/Unix socket]══▶ Process B │
│ │
│ - Network or local (Unix domain sockets) │
│ - Bidirectional │
│ - Most flexible, used everywhere │
│ │
│ 6. SIGNALS │
│ ───────────────────────────────────────────────────────────────────── │
│ Process A ──[SIGTERM]──▶ Process B │
│ │
│ - Asynchronous notifications │
│ - Limited information (just signal number) │
│ - Examples: SIGKILL, SIGTERM, SIGHUP, SIGUSR1 │
│ │
└─────────────────────────────────────────────────────────────────────────┘
SYSTEM DESIGN RELEVANCE:
Microservices typically use:
- HTTP/gRPC over sockets (most common)
- Message queues (Kafka, RabbitMQ, SQS)
Same-machine optimization:
- Unix domain sockets (faster than TCP loopback)
- Shared memory (databases, caches)
Chapter 2: Threads
2.1 Threads vs Processes
┌────────────────────────────────────────────────────────────────────────┐
│ PROCESS vs THREAD │
│ │
│ PROCESS THREAD │
│ ───────────────────────────────── ───────────────────────────── │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Memory Space │ │ Stack │ │
│ │ ┌──────────────┐ │ │ (own) │ │
│ │ │ Stack │ │ │ │ │
│ │ ├──────────────┤ │ │ Registers │ │
│ │ │ Heap │ │ │ (own) │ │
│ │ ├──────────────┤ │ │ │ │
│ │ │ Data │ │ │ Thread ID │ │
│ │ ├──────────────┤ │ │ (own) │ │
│ │ │ Text │ │ └──────────────────┘ │
│ │ └──────────────┘ │ │ │
│ │ │ │ Shares │
│ │ File Descriptors │ ▼ │
│ │ PID │ ┌──────────────────┐ │
│ │ Environment │ │ Heap │ │
│ └──────────────────┘ │ Data │ │
│ │ Text (Code) │ │
│ Isolated from others │ File Descriptors │ │
│ Heavy to create (~10ms) │ PID │ │
│ Context switch: ~1000 μs └──────────────────┘ │
│ │
│ Share with other threads │
│ Light to create (~10μs) │
│ Context switch: ~1-10 μs │
│ │
└────────────────────────────────────────────────────────────────────────┘
KEY DIFFERENCES:
| Aspect | Process | Thread |
|------------------|----------------------|----------------------|
| Memory | Isolated | Shared |
| Creation cost | High (~10ms) | Low (~10μs) |
| Context switch | Expensive | Cheap |
| Communication | IPC needed | Shared memory |
| Crash isolation | One crash, one dead | One crash, all dead |
| Scalability | Multi-machine | Single machine |
2.2 Thread Models
THREADING MODELS
1. KERNEL THREADS (1:1)
─────────────────────────────────────────────────────────────────────
Each user thread = one kernel thread
User Space: [T1] [T2] [T3]
│ │ │
Kernel Space: [K1] [K2] [K3]
Pros: True parallelism, simple
Cons: Kernel overhead per thread
Used by: Linux pthreads, Java threads
2. USER THREADS (N:1)
─────────────────────────────────────────────────────────────────────
Many user threads = one kernel thread
User Space: [T1] [T2] [T3] [T4]
\ | | /
Kernel Space: [K1]
Pros: Fast creation/switching
Cons: No parallelism, one block = all block
Used by: Old green threads
3. HYBRID (M:N)
─────────────────────────────────────────────────────────────────────
Many user threads = some kernel threads
User Space: [T1] [T2] [T3] [T4] [T5] [T6]
\ | / \ | /
Kernel Space: [K1] [K2]
Pros: Parallelism + cheap user threads
Cons: Complex scheduling
Used by: Go goroutines, Erlang processes
2.3 Concurrency vs Parallelism
CONCURRENCY vs PARALLELISM
CONCURRENCY (dealing with many things at once):
─────────────────────────────────────────────────────────────────────
Single core, multiple tasks, interleaved
Time →
Core 1: [Task A][Task B][Task A][Task C][Task B][Task A]
Tasks make progress, but not simultaneously
Useful for I/O-bound work
PARALLELISM (doing many things at once):
─────────────────────────────────────────────────────────────────────
Multiple cores, tasks run simultaneously
Time →
Core 1: [Task A][Task A][Task A][Task A]
Core 2: [Task B][Task B][Task B][Task B]
Core 3: [Task C][Task C][Task C][Task C]
True simultaneous execution
Useful for CPU-bound work
YOU CAN HAVE:
- Concurrency without parallelism (single core, multiple tasks)
- Parallelism without concurrency (multiple cores, one task per core)
- Both (multiple cores, many tasks)
2.4 Thread Safety and Synchronization
THREAD SAFETY PROBLEMS
1. RACE CONDITION
─────────────────────────────────────────────────────────────────────
Two threads access shared data, at least one writes
Result depends on timing (non-deterministic)
Thread A: read counter (0)
Thread B: read counter (0)
Thread A: counter = 0 + 1
Thread B: counter = 0 + 1
Thread A: write counter (1)
Thread B: write counter (1)
Expected: 2 Actual: 1 ← BUG!
2. DEADLOCK
─────────────────────────────────────────────────────────────────────
Two threads each hold a lock the other needs
Thread A: lock(A) Thread B: lock(B)
Thread A: waiting for B Thread B: waiting for A
Both stuck forever!
Prevention:
- Lock ordering (always acquire in same order)
- Lock timeout
- Deadlock detection
3. LIVELOCK
─────────────────────────────────────────────────────────────────────
Threads keep responding to each other but make no progress
Like two people in hallway both stepping aside
4. STARVATION
─────────────────────────────────────────────────────────────────────
One thread never gets resources (always preempted)
2.5 Synchronization Primitives
┌────────────────────────────────────────────────────────────────────────┐
│ SYNCHRONIZATION PRIMITIVES │
│ │
│ 1. MUTEX (Mutual Exclusion) │
│ ──────────────────────────────────────────────────────────────────── │
│ Only one thread can hold at a time │
│ │
│ mutex.lock() │
│ // Critical section - only one thread here │
│ mutex.unlock() │
│ │
│ Use for: Protecting shared data │
│ │
│ 2. SEMAPHORE │
│ ──────────────────────────────────────────────────────────────────── │
│ Counter that allows N threads │
│ │
│ sem = Semaphore(3) // Allow 3 concurrent │
│ sem.acquire() // Decrement (block if 0) │
│ // Up to 3 threads here │
│ sem.release() // Increment │
│ │
│ Use for: Connection pools, rate limiting │
│ │
│ 3. READ-WRITE LOCK (RWLock) │
│ ──────────────────────────────────────────────────────────────────── │
│ Multiple readers OR single writer │
│ │
│ rwlock.read_lock() // Many can hold simultaneously │
│ // Reading shared data │
│ rwlock.read_unlock() │
│ │
│ rwlock.write_lock() // Exclusive access │
│ // Modifying shared data │
│ rwlock.write_unlock() │
│ │
│ Use for: Read-heavy workloads │
│ │
│ 4. CONDITION VARIABLE │
│ ──────────────────────────────────────────────────────────────────── │
│ Wait for condition, wake when signaled │
│ │
│ Producer: Consumer: │
│ mutex.lock() mutex.lock() │
│ queue.push(item) while queue.empty(): │
│ cond.signal() cond.wait(mutex) │
│ mutex.unlock() item = queue.pop() │
│ mutex.unlock() │
│ │
│ Use for: Producer-consumer, thread pools │
│ │
│ 5. SPINLOCK │
│ ──────────────────────────────────────────────────────────────────── │
│ Busy-wait instead of sleeping │
│ │
│ while (!lock.try_acquire()): │
│ // spin (waste CPU) │
│ // critical section │
│ lock.release() │
│ │
│ Use for: Very short critical sections (kernel code) │
│ Avoid for: User-space, long waits │
│ │
│ 6. ATOMIC OPERATIONS │
│ ──────────────────────────────────────────────────────────────────── │
│ Hardware-supported indivisible operations │
│ │
│ atomic_int counter = 0; │
│ counter.fetch_add(1); // Atomic increment │
│ counter.compare_exchange(expected, desired); // CAS │
│ │
│ Use for: Counters, lock-free data structures │
│ │
└────────────────────────────────────────────────────────────────────────┘
2.6 Lock-Free Programming
COMPARE-AND-SWAP (CAS)
Atomic operation that:
1. Compares value to expected
2. If equal, swaps with new value
3. Returns whether swap happened
bool CAS(address, expected, new_value):
// This is atomic (hardware instruction)
if *address == expected:
*address = new_value
return true
return false
LOCK-FREE COUNTER EXAMPLE:
class LockFreeCounter:
def __init__(self):
self.value = AtomicInt(0)
def increment(self):
while True:
old = self.value.get()
new = old + 1
if self.value.compare_and_swap(old, new):
return new
# CAS failed, another thread changed it, retry
WHY LOCK-FREE?
- No deadlocks
- No priority inversion
- Progress guarantee (someone always makes progress)
DOWNSIDES:
- Complex to implement correctly
- Can have contention (many retries)
- Hard to reason about
Chapter 3: Context Switching
3.1 What Happens in a Context Switch
CONTEXT SWITCH PROCESS
When OS switches from Process A to Process B:
1. SAVE STATE OF PROCESS A
─────────────────────────────────────────────────────────────────────
- All CPU registers (general purpose, program counter, stack pointer)
- Memory management info (page table pointer)
- Floating point state
→ Saved to Process A's Process Control Block (PCB)
2. UPDATE SCHEDULER
─────────────────────────────────────────────────────────────────────
- Mark Process A as READY (or WAITING)
- Select Process B to run
- Mark Process B as RUNNING
3. RESTORE STATE OF PROCESS B
─────────────────────────────────────────────────────────────────────
- Load registers from Process B's PCB
- Load page table pointer (TLB flush!)
- Jump to Process B's program counter
TIME COSTS:
Direct costs:
- Register save/restore: ~100-1000 cycles
- Page table switch: ~1000+ cycles
Indirect costs (often bigger!):
- TLB flush: Cache misses for a while
- CPU cache pollution: Process B's data not in L1/L2
- Pipeline flush: Branch prediction invalid
Total: 1-10 μs for threads, 10-100 μs for processes
3.2 Minimizing Context Switches
STRATEGIES TO REDUCE CONTEXT SWITCHES
1. USE ASYNC I/O (Event Loop)
─────────────────────────────────────────────────────────────────────
One thread handles many connections
No blocking = no forced context switch
Examples: Node.js, nginx, Redis
2. THREAD POOLS
─────────────────────────────────────────────────────────────────────
Fixed number of threads, reused
Avoid thread creation/destruction overhead
pool = ThreadPool(num_cores * 2)
pool.submit(task)
3. CPU AFFINITY
─────────────────────────────────────────────────────────────────────
Pin threads to specific CPU cores
Better cache utilization
taskset -c 0,1 ./myprogram
4. BATCH PROCESSING
─────────────────────────────────────────────────────────────────────
Process multiple items per wake-up
Amortize context switch cost
while items = queue.get_batch(100):
process(items)
5. COOPERATIVE SCHEDULING
─────────────────────────────────────────────────────────────────────
Coroutines yield voluntarily
No preemptive switches
Examples: Python asyncio, Go goroutines
Part II: Memory Management
Chapter 4: Virtual Memory
4.1 Why Virtual Memory?
VIRTUAL MEMORY BENEFITS
1. ISOLATION
Each process thinks it has entire address space
Can't accidentally (or maliciously) access other process's memory
2. SIMPLIFICATION
Programmers use simple linear addresses
OS handles mapping to physical memory
3. OVERCOMMIT
Can allocate more virtual memory than physical RAM
Pages stored on disk when not in use (swap)
4. SHARED MEMORY
Multiple processes can map same physical page
Used for: shared libraries, IPC
VIRTUAL ADDRESS SPACE (64-bit Linux):
0xFFFFFFFFFFFFFFFF ┌──────────────────────┐
│ Kernel Space │ Top 128 TB
│ (not accessible │
│ from user mode) │
0xFFFF800000000000 ├──────────────────────┤
│ │
│ (huge gap) │ Non-canonical addresses
│ │
0x00007FFFFFFFFFFF ├──────────────────────┤
│ Stack (grows ↓) │
│ ... │ User Space
│ Heap (grows ↑) │ Bottom 128 TB
│ Data │
│ Text │
0x0000000000000000 └──────────────────────┘
4.2 Paging
PAGING: VIRTUAL TO PHYSICAL ADDRESS TRANSLATION
Virtual Address Physical Address
┌───────────────────────────┐ ┌───────────────────────────┐
│ Page Number │ Offset │ │ Frame Number │ Offset │
│ (VPN) │ │ │ (PFN) │ │
└──────┬──────┴─────────────┘ └──────────────┴────────────┘
│ ▲
│ │
│ ┌────────────────────┐ │
└───▶│ Page Table │────────┘
│ │
│ VPN 0 → PFN 3 │
│ VPN 1 → PFN 7 │
│ VPN 2 → [disk] │ ← Page fault!
│ VPN 3 → PFN 1 │
│ ... │
└────────────────────┘
PAGE SIZE:
- Typical: 4 KB
- Huge pages: 2 MB or 1 GB (for databases, VMs)
4 KB pages for 4 GB RAM = 1 million page table entries!
Solution: Multi-level page tables
MULTI-LEVEL PAGE TABLE (x86-64):
Virtual Address (48 bits used):
┌────────┬────────┬────────┬────────┬──────────────┐
│ PML4 │ PDPT │ PD │ PT │ Offset │
│ 9 bits │ 9 bits │ 9 bits │ 9 bits │ 12 bits │
└───┬────┴───┬────┴───┬────┴───┬────┴──────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Level 4 → Level 3 → Level 2 → Level 1 → Physical Frame
Sparse tables: Only allocate entries that are used
4.3 Translation Lookaside Buffer (TLB)
TLB: PAGE TABLE CACHE
Page table lookups are slow (multiple memory accesses)
TLB caches recent translations
┌────────────────────────────────────────────────────────────────────────┐
│ TLB │
│ │
│ CPU ──▶ Virtual Address │
│ │ │
│ ▼ │
│ ┌─────────┐ HIT ┌─────────────────────┐ │
│ │ TLB │──────────▶│ Physical Address │──▶ Memory │
│ │ (Cache) │ └─────────────────────┘ │
│ └────┬────┘ │
│ │ MISS │
│ ▼ │
│ ┌───────────┐ │
│ │Page Table │ (slow: multiple memory reads) │
│ │ Walk │ │
│ └─────┬─────┘ │
│ │ │
│ ▼ │
│ Update TLB, return Physical Address │
│ │
└────────────────────────────────────────────────────────────────────────┘
TLB CHARACTERISTICS:
- Size: 64-1024 entries (tiny!)
- Hit rate: 99%+ for well-behaved programs
- Miss penalty: 10-100 cycles
TLB FLUSH:
- Happens on process switch (page table changes)
- Expensive! All cached translations lost
- PCID (Process Context ID) can help avoid full flush
4.4 Page Faults
PAGE FAULT TYPES
1. MINOR PAGE FAULT (Soft)
─────────────────────────────────────────────────────────────────────
Page is in memory but not mapped in page table
Causes:
- First access to newly allocated memory
- Copy-on-write after fork()
Resolution: Update page table, no disk I/O
Cost: ~1-10 μs
2. MAJOR PAGE FAULT (Hard)
─────────────────────────────────────────────────────────────────────
Page is not in memory, must read from disk
Causes:
- Page was swapped out
- Memory-mapped file access
Resolution: Read page from disk
Cost: ~1-10 ms (1000x slower than minor!)
3. INVALID PAGE FAULT
─────────────────────────────────────────────────────────────────────
Access to unmapped or protected memory
Causes:
- Null pointer dereference
- Stack overflow
- Writing to read-only memory
Resolution: SIGSEGV (crash)
4.5 Memory Allocation
MEMORY ALLOCATION STRATEGIES
USER SPACE (malloc/free):
┌────────────────────────────────────────────────────────────────────┐
│ HEAP │
│ │
│ ┌────┬────┬────────┬────┬──────────┬────┬─────────────────────┐ │
│ │Used│Free│ Used │Free│ Used │Free│ Unallocated │ │
│ │32B │64B │ 128B │32B │ 256B │128B│ │ │
│ └────┴────┴────────┴────┴──────────┴────┴─────────────────────┘ │
│ │
│ Allocators: glibc malloc, jemalloc, tcmalloc │
│ │
└────────────────────────────────────────────────────────────────────┘
ALLOCATION STRATEGIES:
1. First Fit: Use first free block that fits
2. Best Fit: Use smallest free block that fits
3. Buddy System: Powers of 2, split/merge
FRAGMENTATION:
External: Free memory scattered in small pieces
┌────┬────┬────┬────┬────┬────┬────┬────┐
│Used│Free│Used│Free│Used│Free│Used│Free│
│ │64B │ │32B │ │64B │ │32B │ Total free: 192B
└────┴────┴────┴────┴────┴────┴────┴────┘ But can't allocate 100B!
Internal: Allocated block larger than requested
Requested: 100 bytes
Allocated: 128 bytes (next power of 2)
Wasted: 28 bytes
Chapter 5: Caching and Memory Hierarchy
5.1 Memory Hierarchy
MEMORY HIERARCHY (Latency & Size)
┌─────────────┐
│ Registers │ < 1 ns, bytes
└──────┬──────┘
│
┌──────┴──────┐
│ L1 Cache │ ~1 ns, 32-64 KB (per core)
└──────┬──────┘
│
┌──────┴──────┐
│ L2 Cache │ ~4 ns, 256 KB - 1 MB (per core)
└──────┬──────┘
│
┌──────┴──────┐
│ L3 Cache │ ~10-20 ns, 8-64 MB (shared)
└──────┬──────┘
│
┌──────┴──────┐
│ RAM │ ~100 ns, 16-512 GB
└──────┬──────┘
│
┌──────┴──────┐
│ SSD │ ~100 μs, 256 GB - 8 TB
└──────┬──────┘
│
┌──────┴──────┐
│ HDD │ ~10 ms, 1-20 TB
└─────────────┘
LATENCY REFERENCE (Approximate):
L1 cache reference: 1 ns
L2 cache reference: 4 ns
L3 cache reference: 20 ns
RAM reference: 100 ns
SSD random read: 100,000 ns (100 μs)
HDD seek: 10,000,000 ns (10 ms)
L1 is 100x faster than RAM
RAM is 100x faster than SSD
SSD is 100x faster than HDD
5.2 CPU Cache Behavior
CPU CACHE CONCEPTS
CACHE LINE:
- Unit of data transfer between cache levels
- Typically 64 bytes
- Even if you read 1 byte, 64 bytes are fetched
Memory: [........][........][........][........]
│ │
└───────────┴── Each block is 64 bytes
CACHE ASSOCIATIVITY:
Direct Mapped (1-way):
Each memory address maps to exactly one cache location
Fast but conflicts
N-way Set Associative:
Address can be in N different locations
Better hit rate, slightly slower
Fully Associative:
Can be anywhere in cache
Best hit rate, slowest lookup
CACHE COHERENCE (Multi-core):
Problem: Multiple cores have copies of same data
Core 1 cache: [x = 5]
Core 2 cache: [x = 5]
Core 1: x = 10 (updates its cache)
Core 2: reads x (sees stale value 5!)
Solution: MESI Protocol
Modified: Only this cache has it, dirty
Exclusive: Only this cache has it, clean
Shared: Multiple caches have it, clean
Invalid: Cache line is invalid
5.3 Cache-Friendly Code
CACHE-FRIENDLY PROGRAMMING
1. SEQUENTIAL ACCESS (Good)
─────────────────────────────────────────────────────────────────────
// Array stored: [0][1][2][3][4][5][6][7]
for (int i = 0; i < N; i++)
sum += array[i]; // Sequential, cache-friendly
Each cache line (64B) loads ~16 integers
Next iteration hits cache (spatial locality)
2. STRIDED ACCESS (Bad)
─────────────────────────────────────────────────────────────────────
// 2D array stored row-major: [0,0][0,1][0,2]...[1,0][1,1]...
// BAD: Column-major traversal
for (int j = 0; j < cols; j++)
for (int i = 0; i < rows; i++)
sum += matrix[i][j]; // Jumping around, cache misses!
// GOOD: Row-major traversal
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
sum += matrix[i][j]; // Sequential, cache-friendly
3. STRUCTURE OF ARRAYS (SoA) vs ARRAY OF STRUCTURES (AoS)
─────────────────────────────────────────────────────────────────────
// AoS (Array of Structures) - common but often slower
struct Entity { float x, y, z; int health; bool active; };
Entity entities[1000];
// To sum all x values: loads entire struct but only uses x
for (i = 0..999) sum += entities[i].x; // Cache pollution!
// SoA (Structure of Arrays) - cache-friendly
struct Entities {
float x[1000], y[1000], z[1000];
int health[1000];
bool active[1000];
};
// To sum all x values: only loads x array
for (i = 0..999) sum += entities.x[i]; // Cache-friendly!
5.4 NUMA (Non-Uniform Memory Access)
NUMA ARCHITECTURE
In multi-socket systems, memory access time depends on location:
┌───────────────────────────────────────────────────────────────────┐
│ │
│ Socket 0 Socket 1 │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ CPU 0 CPU 1 │ │ CPU 2 CPU 3 │ │
│ │ ↓ ↓ │ │ ↓ ↓ │ │
│ │ ┌──────────────┐ │ QPI/UPI │ ┌──────────────┐ │ │
│ │ │ Memory │ │◀───────────▶ │ │ Memory │ │ │
│ │ │ Controller │ │ (slow) │ │ Controller │ │ │
│ │ └──────┬───────┘ │ │ └──────┬───────┘ │ │
│ │ │ │ │ │ │ │
│ │ ┌─────┴─────┐ │ │ ┌─────┴─────┐ │ │
│ │ │ RAM Node 0│ │ │ │ RAM Node 1│ │ │
│ │ │ 64 GB │ │ │ │ 64 GB │ │ │
│ │ └───────────┘ │ │ └───────────┘ │ │
│ └────────────────────┘ └────────────────────┘ │
│ │
│ Local access (Node 0 → RAM 0): ~100 ns │
│ Remote access (Node 0 → RAM 1): ~150-200 ns (1.5-2x slower!) │
│ │
└───────────────────────────────────────────────────────────────────┘
NUMA-AWARE PROGRAMMING:
1. Allocate memory on local node
numactl --membind=0 ./myprogram
2. Bind threads to cores on same node as their data
numactl --cpunodebind=0 ./myprogram
3. Database optimization
- Partition data by NUMA node
- Pin worker threads to nodes
Part III: I/O Models
Chapter 6: I/O Fundamentals
6.1 Blocking vs Non-Blocking I/O
BLOCKING I/O
Thread Kernel
│ │
│ read(fd, buf, size) │
│ ──────────────────────────────────▶ │
│ │
│ [Thread blocked/sleeping] │
│ │ Waiting for
│ [Cannot do other work] │ data from
│ │ disk/network
│ │
│ ◀───────────────────────────────── │
│ Data returned │
│ │
│ [Thread continues] │
Problem: Thread is idle while waiting
Solution: More threads (but threads are expensive)
NON-BLOCKING I/O
Thread Kernel
│ │
│ read(fd, buf, size) │
│ ──────────────────────────────────▶ │
│ │
│ ◀── EAGAIN (would block) ────────── │ No data yet
│ │
│ [Do other work] │
│ │
│ read(fd, buf, size) │
│ ──────────────────────────────────▶ │
│ │
│ ◀── EAGAIN ───────────────────────── │ Still no data
│ │
│ [Do other work] │
│ │
│ read(fd, buf, size) │
│ ──────────────────────────────────▶ │
│ │
│ ◀── Data ─────────────────────────── │ Data ready!
Problem: Busy polling wastes CPU
Solution: Event notification (select/poll/epoll)
6.2 I/O Multiplexing
I/O MULTIPLEXING: MONITORING MANY FILE DESCRIPTORS
Instead of: One thread per connection
Use: One thread monitors many connections
┌────────────────────────────────────────────────────────────────────────┐
│ I/O MULTIPLEXING EVOLUTION │
│ │
│ 1. SELECT (1983) │
│ ──────────────────────────────────────────────────────────────────── │
│ ready = select(fds, read_set, write_set, error_set, timeout) │
│ │
│ Limitations: │
│ - Max 1024 file descriptors (FD_SETSIZE) │
│ - O(n) kernel scan of all fds │
│ - Must rebuild fd sets after each call │
│ │
│ 2. POLL (1986) │
│ ──────────────────────────────────────────────────────────────────── │
│ ready = poll(fds[], nfds, timeout) │
│ │
│ Improvements over select: │
│ - No fd limit │
│ - Separate input/output events │
│ │
│ Still has: │
│ - O(n) kernel scan │
│ - Copy entire array to/from kernel │
│ │
│ 3. EPOLL (Linux 2002) │
│ ──────────────────────────────────────────────────────────────────── │
│ epfd = epoll_create() │
│ epoll_ctl(epfd, ADD, fd, events) // Register once │
│ ready = epoll_wait(epfd, events, max, timeout) │
│ │
│ Advantages: │
│ - O(1) for monitoring (kernel maintains state) │
│ - Only returns ready fds (not all fds) │
│ - Edge-triggered mode for high performance │
│ │
│ 4. KQUEUE (BSD/macOS) │
│ ──────────────────────────────────────────────────────────────────── │
│ Similar to epoll, used on BSD/macOS │
│ │
│ 5. IOCP (Windows) │
│ ──────────────────────────────────────────────────────────────────── │
│ I/O Completion Ports │
│ True async I/O (completion-based, not readiness-based) │
│ │
└────────────────────────────────────────────────────────────────────────┘
6.3 Event-Driven Architecture
EVENT LOOP (Single-Threaded Async)
┌────────────────────────────────────────────────────────────────────┐
│ EVENT LOOP │
│ │
│ while (running): │
│ events = epoll_wait(epfd, timeout) ◀── Block until ready │
│ │
│ for event in events: │
│ if event.type == READABLE: │
│ data = read(event.fd) │
│ process(data) │
│ │
│ if event.type == WRITABLE: │
│ write(event.fd, pending_data) │
│ │
│ run_timers() │
│ run_callbacks() │
│ │
└────────────────────────────────────────────────────────────────────┘
REACTOR PATTERN:
┌────────────┐ ┌────────────────┐ ┌────────────────┐
│ Clients │ │ Reactor │ │ Handlers │
│ │ │ (Event Loop) │ │ │
│ Client 1 ─┼─────▶│ │─────▶│ HTTP Handler │
│ Client 2 ─┼─────▶│ Demultiplexer │─────▶│ WS Handler │
│ Client 3 ─┼─────▶│ (epoll) │─────▶│ File Handler │
│ │ │ │ │ │
└────────────┘ └────────────────┘ └────────────────┘
USED BY:
- Node.js (libuv)
- nginx
- Redis
- Python asyncio
6.4 Async I/O Patterns
I/O PATTERNS COMPARISON
1. THREAD-PER-CONNECTION
─────────────────────────────────────────────────────────────────────
Main Thread: accept() → spawn new thread
Thread 1: handle(client1) [blocking I/O]
Thread 2: handle(client2) [blocking I/O]
Thread 3: handle(client3) [blocking I/O]
...
Pros: Simple programming model
Cons: Threads expensive, doesn't scale past ~10K connections
Used: Traditional web servers (Apache prefork)
2. THREAD POOL
─────────────────────────────────────────────────────────────────────
Main Thread: accept() → queue task
Worker Pool: threads pull from queue
┌──────────┐ ┌─────────┐ ┌──────────┐
│ Acceptor │────▶│ Queue │────▶│ Workers │
└──────────┘ └─────────┘ │ Thread 1 │
│ Thread 2 │
│ Thread 3 │
└──────────┘
Pros: Bounded threads, task queuing
Cons: Still blocking I/O, limited concurrency
Used: Java servlet containers
3. EVENT-DRIVEN (Single Thread)
─────────────────────────────────────────────────────────────────────
Single Thread: epoll_wait → handle ready events → repeat
Pros: Massive concurrency, no thread overhead
Cons: Can't use multiple cores, callback hell
Used: Node.js, Redis
4. HYBRID (Event Loop + Thread Pool)
─────────────────────────────────────────────────────────────────────
Event Loop: handles I/O, delegates CPU work
Thread Pool: handles CPU-intensive tasks
┌──────────────┐ ┌─────────────┐
│ Event Loop │────▶│ Thread Pool │
│ (I/O bound) │◀────│ (CPU bound) │
└──────────────┘ └─────────────┘
Pros: Best of both worlds
Cons: More complex
Used: Node.js (libuv workers), nginx
5. MULTI-THREADED EVENT LOOPS
─────────────────────────────────────────────────────────────────────
Multiple event loops, one per core
Core 0: Event Loop 0 (handles connections 0, 4, 8, ...)
Core 1: Event Loop 1 (handles connections 1, 5, 9, ...)
Core 2: Event Loop 2 (handles connections 2, 6, 10, ...)
Core 3: Event Loop 3 (handles connections 3, 7, 11, ...)
Pros: Scales to multiple cores
Cons: Cross-thread communication needed
Used: nginx (worker processes), Go runtime
6.5 io_uring (Linux)
IO_URING: NEXT-GENERATION LINUX ASYNC I/O
Traditional async I/O problems:
- epoll: Readiness-based (still need syscall to read)
- aio: Limited (files only, not network)
io_uring (Linux 5.1+):
- True async I/O for files AND network
- Submission and completion queues in shared memory
- Batching: Submit many operations with one syscall
┌────────────────────────────────────────────────────────────────────────┐
│ IO_URING │
│ │
│ User Space Kernel Space │
│ │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ Submission Queue │ Shared │ │ │
│ │ (SQ) │◀──Memory────────▶│ Kernel │ │
│ │ │ │ Processing │ │
│ │ [op1][op2][op3] │ │ │ │
│ └───────────────────┘ └───────────────────┘ │
│ │
│ ┌───────────────────┐ │
│ │ Completion Queue │ User polls for completions │
│ │ (CQ) │ No syscall needed! │
│ │ │ │
│ │ [result1][res2] │ │
│ └───────────────────┘ │
│ │
│ Benefits: │
│ - Batch submissions (one syscall for many ops) │
│ - Zero-copy where possible │
│ - Poll mode: no syscalls at all │
│ - Supports: read, write, accept, connect, send, recv, etc. │
│ │
└────────────────────────────────────────────────────────────────────────┘
Chapter 7: File Systems
7.1 File System Basics
FILE SYSTEM STRUCTURE
┌─────────────────────────────────────────┐
│ USER SPACE │
│ │
│ Application: open("/data/file.txt") │
│ │ │
└────────────────────┼────────────────────┘
│ System Call
┌────────────────────┼────────────────────┐
│ KERNEL SPACE │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Virtual File System (VFS) │ │
│ │ Common interface for all FS │ │
│ └──────────────┬──────────────────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────┐ ┌──────┐ ┌──────┐ │
│ │ext4│ │ XFS │ │ NFS │ │
│ └──┬─┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │
│ ┌──┴──────────┴────────────┴──┐ │
│ │ Block Layer │ │
│ │ (I/O scheduling, cache) │ │
│ └──────────────┬──────────────┘ │
│ │ │
└─────────────────┼───────────────────────┘
▼
┌─────────────────────────────────────────┐
│ HARDWARE │
│ (HDD, SSD, NVMe) │
└─────────────────────────────────────────┘
7.2 Inodes and Directory Structure
INODE: FILE METADATA
┌────────────────────────────────────────────────────────────────────────┐
│ INODE │
│ │
│ inode number: 12345 │
│ │
│ ├── File type (regular, directory, symlink, etc.) │
│ ├── Permissions (rwxr-xr-x) │
│ ├── Owner (UID) │
│ ├── Group (GID) │
│ ├── Size (bytes) │
│ ├── Timestamps (atime, mtime, ctime) │
│ ├── Link count │
│ └── Data block pointers: │
│ ├── Direct blocks (12): point directly to data │
│ ├── Indirect block: points to block of pointers │
│ ├── Double indirect: pointer → pointers → pointers → data │
│ └── Triple indirect: one more level │
│ │
└────────────────────────────────────────────────────────────────────────┘
NOTE: Filename is NOT in inode!
Directory maps filename → inode number
DIRECTORY STRUCTURE:
Directory "mydir" (inode 100):
┌────────────────────────────────┐
│ . → 100 (self) │
│ .. → 50 (parent) │
│ file.txt → 12345 │
│ subdir → 200 │
│ link.txt → 12345 (hardlink) │
└────────────────────────────────┘
7.3 Page Cache (Buffer Cache)
PAGE CACHE: OS-LEVEL FILE CACHING
┌────────────────────────────────────────────────────────────────────────┐
│ PAGE CACHE │
│ │
│ Application │
│ │ │
│ │ read(fd, buf, 4096) │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ PAGE CACHE (RAM) │ │
│ │ │ │
│ │ ┌───────┬───────┬───────┬───────┬───────┬───────┐ │ │
│ │ │Page 0 │Page 1 │Page 2 │Page 3 │ ... │Page N │ │ │
│ │ │(file1)│(file1)│(file2)│(file3)│ │ │ │ │
│ │ └───────┴───────┴───────┴───────┴───────┴───────┘ │ │
│ │ │ │
│ │ Cache HIT: Return from RAM (~100ns) │ │
│ │ Cache MISS: Read from disk, cache, return (~10ms for HDD) │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ WRITE BEHAVIOR: │
│ │
│ 1. Write-back (default): │
│ Write to cache → return immediately → flush later │
│ Fast but data loss risk on crash │
│ │
│ 2. Write-through: │
│ Write to cache AND disk → return │
│ Slower but safer │
│ │
│ 3. O_DIRECT: │
│ Bypass cache entirely │
│ Used by databases that manage own cache │
│ │
│ 4. fsync(): │
│ Force flush to disk │
│ Ensures durability │
│ │
└────────────────────────────────────────────────────────────────────────┘
SYSTEM DESIGN IMPACT:
Reading same file twice?
→ Second read from cache (fast)
Too much caching eating RAM?
→ Tune vm.dirty_ratio, vm.dirty_background_ratio
Database doing own caching?
→ Use O_DIRECT to avoid double caching
Need durability?
→ fsync() after writes (databases do this)
7.4 Write-Ahead Logging (WAL)
WRITE-AHEAD LOGGING
Problem:
Updating data in place is not atomic
Crash mid-write = corrupted data
Solution:
Write to log FIRST, then apply changes
On crash recovery, replay log
┌────────────────────────────────────────────────────────────────────────┐
│ WRITE-AHEAD LOG (WAL) │
│ │
│ 1. Write to WAL (append-only, sequential) │
│ 2. fsync() WAL │
│ 3. Apply changes to data files │
│ 4. Eventually checkpoint and truncate WAL │
│ │
│ Normal Operation: │
│ ──────────────────────────────────────────────────── │
│ Transaction → WAL → fsync → Data Files │
│ │
│ Crash Recovery: │
│ ──────────────────────────────────────────────────── │
│ Read WAL → Replay uncommitted transactions → Consistent state │
│ │
│ Why Sequential Writes? │
│ ──────────────────────────────────────────────────── │
│ HDD sequential: ~100 MB/s │
│ HDD random: ~1 MB/s │
│ WAL is append-only → sequential → fast! │
│ │
│ Used By: │
│ - PostgreSQL (WAL) │
│ - MySQL InnoDB (redo log) │
│ - SQLite (WAL mode) │
│ - Kafka (append-only log) │
│ - LSM-tree databases (RocksDB, LevelDB) │
│ │
└────────────────────────────────────────────────────────────────────────┘
Part IV: System Limits and Tuning
Chapter 8: File Descriptors
8.1 Understanding File Descriptors
FILE DESCRIPTORS
Everything in Unix is a file (or file-like):
- Regular files
- Directories
- Sockets (network connections)
- Pipes
- Devices
File Descriptor = integer handle to open file/resource
┌────────────────────────────────────────────────────────────────────────┐
│ PROCESS FILE DESCRIPTOR TABLE │
│ │
│ fd │ Description │ Points to │
│ ───┼────────────────────┼──────────────────────────────────────── │
│ 0 │ stdin │ Terminal input │
│ 1 │ stdout │ Terminal output │
│ 2 │ stderr │ Terminal error │
│ 3 │ log file │ /var/log/app.log │
│ 4 │ socket │ TCP connection to database │
│ 5 │ socket │ TCP connection to client │
│ 6 │ socket │ TCP connection to client │
│ ...│ │ │
│ │
└────────────────────────────────────────────────────────────────────────┘
LIMITS:
Per-process limit:
ulimit -n # Show current limit (often 1024 default)
ulimit -n 65535 # Increase for this shell
System-wide limit:
cat /proc/sys/fs/file-max # System maximum
For servers handling many connections:
Each TCP connection = 1 file descriptor
10,000 connections = need at least 10,000 fd limit
8.2 Common Limits to Tune
┌────────────────────────────────────────────────────────────────────────┐
│ SYSTEM LIMITS TO TUNE │
│ │
│ LIMIT │ DEFAULT │ TUNED VALUE │ WHY │
│ ───────────────────────┼────────────┼─────────────┼───────────────── │
│ │
│ FILE DESCRIPTORS │
│ ulimit -n │ 1,024 │ 65,535+ │ Many connections │
│ /etc/security/limits.conf │
│ │
│ PROCESSES/THREADS │
│ ulimit -u │ ~30,000 │ 100,000+ │ Thread pool size │
│ kernel.pid_max │
│ │
│ TCP TUNING │
│ net.core.somaxconn │ 128 │ 65,535 │ Listen backlog │
│ net.ipv4.tcp_max_syn_backlog │ 128 │ 65,535 │ SYN queue size │
│ net.ipv4.ip_local_port_range │ 32768-60999 │ 1024-65535 │ More ports │
│ net.ipv4.tcp_tw_reuse │ 0 │ 1 │ Reuse TIME_WAIT │
│ │
│ MEMORY │
│ vm.swappiness │ 60 │ 10 │ Reduce swap usage │
│ vm.dirty_ratio │ 20 │ 60 │ Write-back tuning │
│ vm.overcommit_memory │ 0 │ 1 │ Redis requirement │
│ │
│ NETWORK BUFFERS │
│ net.core.rmem_max │ 212992 │ 16777216 │ TCP receive buf │
│ net.core.wmem_max │ 212992 │ 16777216 │ TCP send buffer │
│ │
└────────────────────────────────────────────────────────────────────────┘
CHECKING CURRENT VALUES:
sysctl -a | grep <parameter>
cat /proc/sys/net/core/somaxconn
SETTING VALUES:
Temporary (until reboot):
sysctl -w net.core.somaxconn=65535
Permanent:
echo "net.core.somaxconn = 65535" >> /etc/sysctl.conf
sysctl -p
Chapter 9: Containerization and Namespaces
9.1 Linux Namespaces
NAMESPACES: ISOLATION FOR CONTAINERS
Namespaces provide isolated views of system resources:
┌────────────────────────────────────────────────────────────────────────┐
│ LINUX NAMESPACES │
│ │
│ NAMESPACE │ ISOLATES │ EXAMPLE │
│ ───────────────┼─────────────────────────────┼────────────────────── │
│ PID │ Process IDs │ Container sees PID 1 │
│ NET │ Network stack │ Own IP, ports, routes │
│ MNT │ Mount points │ Own filesystem view │
│ UTS │ Hostname, domain │ Container hostname │
│ IPC │ IPC resources │ Own message queues │
│ USER │ User/group IDs │ Root in container │
│ CGROUP │ Cgroup visibility │ Own cgroup root │
│ │
└────────────────────────────────────────────────────────────────────────┘
CONTAINER = Process with all namespaces isolated
Host Container
──────────────────────────── ────────────────────────────
PID 1: systemd PID 1: nginx (actually 25463 on host)
PID 2: ... PID 2: nginx worker
...
PID 25463: nginx
Network: eth0 (10.0.0.5) Network: eth0 (172.17.0.2)
Filesystem: / Filesystem: / (overlay on host)
9.2 Control Groups (cgroups)
CGROUPS: RESOURCE LIMITS FOR CONTAINERS
┌────────────────────────────────────────────────────────────────────────┐
│ CGROUPS │
│ │
│ Resource limits and accounting for process groups: │
│ │
│ CPU: │
│ cpu.shares Relative CPU weight │
│ cpu.cfs_quota_us Hard CPU limit (μs per period) │
│ cpu.cfs_period_us Period length (default 100ms) │
│ │
│ Example: 50% CPU = quota 50000, period 100000 │
│ │
│ MEMORY: │
│ memory.limit_in_bytes Hard memory limit │
│ memory.soft_limit_in_bytes Soft limit (reclaim target) │
│ memory.oom_control OOM killer behavior │
│ │
│ Container exceeds limit → OOM killed │
│ │
│ I/O: │
│ blkio.throttle.read_bps_device Read bandwidth limit │
│ blkio.throttle.write_bps_device Write bandwidth limit │
│ │
│ NETWORK: │
│ (Handled by tc/iptables, not cgroups directly) │
│ │
└────────────────────────────────────────────────────────────────────────┘
KUBERNETES RESOURCE REQUESTS/LIMITS:
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
resources:
requests: # Scheduling guarantee
memory: "256Mi"
cpu: "250m" # 0.25 cores
limits: # Hard ceiling
memory: "512Mi" # OOM killed if exceeded
cpu: "500m" # Throttled if exceeded
Part V: Interview Questions and Answers
Chapter 10: Common Interview Questions
10.1 Process and Thread Questions
Q: "What's the difference between a process and a thread?"
GREAT ANSWER:
"A process is an independent execution unit with its own memory space,
file descriptors, and system resources. Processes are isolated from each
other — one crashing doesn't affect others.
A thread is a lightweight execution unit within a process. Multiple threads
share the same memory space, heap, and file descriptors, but each has its
own stack and registers.
Key differences:
- Memory: Processes isolated, threads share
- Creation cost: Process ~10ms, thread ~10μs
- Communication: Processes need IPC, threads share memory directly
- Crash impact: Process crash is isolated, thread crash kills whole process
I'd use multiple processes when I need isolation (like running untrusted
code) or want to leverage multiple machines. I'd use threads when I need
shared state and low-latency communication within a single machine.
Modern systems often combine both — like nginx using multiple worker
processes, each with async I/O handling many connections."
────────────────────────────────────────────────────────────────────────────
Q: "How would you handle 1 million concurrent connections?"
GREAT ANSWER:
"This is the C10M problem. The key is minimizing overhead per connection.
First, avoid thread-per-connection. At 1M connections, that's 1M threads,
each using memory for stack (~1MB default). That's 1TB just for stacks!
Instead, I'd use:
1. Event-driven architecture: epoll/kqueue with non-blocking I/O. One thread
can handle thousands of connections by only processing when data is ready.
2. Multiple event loops: One per CPU core. With 32 cores, each handles ~30K
connections.
3. System tuning:
- File descriptor limits: at least 1M
- Ephemeral port range expanded
- TCP buffer sizes optimized
- somaxconn increased for accept queue
4. Memory efficiency:
- Connection state < 1KB per connection
- Zero-copy I/O where possible
- Object pooling to avoid allocation overhead
5. Consider io_uring on modern Linux for even better performance — it
reduces syscall overhead significantly.
Companies like Discord and WhatsApp handle millions of connections using
these techniques."
10.2 Memory Questions
Q: "Explain virtual memory and why it's important."
GREAT ANSWER:
"Virtual memory gives each process the illusion of having its own
contiguous address space, even though physical memory is shared and
fragmented.
How it works:
- Each process has a page table mapping virtual to physical addresses
- CPU's MMU translates addresses on every memory access
- TLB caches recent translations for performance
Key benefits:
1. Isolation: Process A can't access Process B's memory. Even if both use
virtual address 0x1000, they map to different physical frames.
2. Simplification: Programmer doesn't worry about where memory physically
is. malloc() returns an address that just works.
3. Overcommit: Can allocate more virtual memory than physical RAM. Pages
not actively used can be swapped to disk.
4. Shared memory: Same physical page can appear in multiple address spaces.
Used for shared libraries (libc loaded once, shared by all).
5. Memory-mapped files: Map files into address space, let OS handle I/O.
The cost is the page table walk on TLB miss (~100 cycles). That's why TLB
hit rate matters, and why huge pages (2MB instead of 4KB) can help —
fewer TLB entries needed to cover same memory."
────────────────────────────────────────────────────────────────────────────
Q: "What happens when a process accesses memory that's not in RAM?"
GREAT ANSWER:
"That triggers a page fault. There are three types:
1. Minor page fault: The page exists in memory but isn't mapped in the
page table. Common after fork() with copy-on-write, or first access
to newly allocated memory. OS just updates the page table. Cost: ~1μs.
2. Major page fault: The page isn't in memory at all. Either it was
swapped out, or it's a memory-mapped file being accessed. OS must
read from disk. Cost: ~10ms for HDD, ~100μs for SSD.
3. Invalid fault: The address isn't valid — null pointer, stack overflow,
or permission violation. Results in SIGSEGV (segmentation fault).
For high-performance systems, you want to minimize major page faults:
- Disable swap for databases (or use fast NVMe for swap)
- Pre-fault memory on startup (touch all pages)
- Use mlock() to prevent critical pages from swapping
- Monitor page fault rates (vmstat, perf)
Databases like Redis prefer to crash rather than swap — OOM is better
than unpredictable latency from page faults."
10.3 I/O Questions
Q: "Explain the different I/O models in Linux."
GREAT ANSWER:
"There are five main I/O models, each with different tradeoffs:
1. Blocking I/O:
read() blocks until data available. Simple but wastes thread while
waiting. One thread per connection doesn't scale.
2. Non-blocking I/O:
read() returns immediately with EAGAIN if no data. Application must
poll repeatedly. Wastes CPU cycles polling.
3. I/O Multiplexing (select/poll/epoll):
One syscall monitors many file descriptors. epoll_wait() blocks until
ANY descriptor is ready, then returns which ones. Efficient for many
connections. This is what nginx, Node.js, Redis use.
4. Signal-driven I/O:
Kernel sends signal when I/O is ready. Application handles in signal
handler. Rarely used — signals are tricky.
5. Async I/O (aio, io_uring):
Submit I/O operations, kernel completes them in background. io_uring
on modern Linux is the most efficient — submissions and completions
via shared memory rings, minimal syscalls.
For a high-performance server, I'd use epoll (or kqueue on BSD) for
network I/O. For disk I/O where async matters (like databases), I'd
consider io_uring on Linux 5.1+.
The key insight: match I/O model to workload. CPU-bound work needs
threads. I/O-bound work with many connections needs event-driven."
────────────────────────────────────────────────────────────────────────────
Q: "How does epoll differ from select?"
GREAT ANSWER:
"Both monitor multiple file descriptors, but epoll is far more efficient
at scale.
select() problems:
- Limited to 1024 fds (FD_SETSIZE)
- O(n) kernel scan of ALL fds each call
- Must rebuild fd_set after each call
- Returns all fds, you check which are ready
epoll() solutions:
- No fd limit (just system resources)
- O(1) monitoring — kernel maintains interest list
- Register once with epoll_ctl(), monitor forever
- Only returns READY fds, not all fds
Performance comparison with 10,000 connections:
- select: Scan 10,000 fds per call, even if only 1 is ready
- epoll: Return immediately with just the 1 ready fd
epoll also supports edge-triggered mode: only notifies on state CHANGE,
not while state IS ready. More efficient but requires draining the buffer
completely each notification.
For any serious server handling more than a few hundred connections,
epoll (or kqueue on macOS/BSD) is essential."
10.4 Concurrency Questions
Q: "What is a deadlock and how do you prevent it?"
GREAT ANSWER:
"A deadlock is when two or more threads are waiting for each other to
release resources, creating a cycle where none can proceed.
Four conditions must ALL be true for deadlock:
1. Mutual exclusion: Resources can't be shared
2. Hold and wait: Thread holds one resource while waiting for another
3. No preemption: Resources can't be forcibly taken
4. Circular wait: A cycle exists in the wait graph
Prevention strategies break one condition:
1. Lock ordering: Always acquire locks in the same order globally.
If everyone acquires A before B, no cycle is possible.
2. Lock timeout: Don't wait forever. Try to acquire, timeout, release
all locks, retry with backoff. Breaks 'hold and wait'.
3. Single lock: Use one coarse-grained lock instead of many fine-grained
ones. Simpler but reduces parallelism.
4. Lock-free algorithms: Use atomic operations (CAS) instead of locks.
No locks = no deadlock. But complex to implement correctly.
5. Deadlock detection: Build wait-for graph, detect cycles, abort one
transaction. Used by databases.
In practice, I use a combination:
- Lock ordering for related locks
- Timeouts for external resources
- Careful design to minimize lock scope"
────────────────────────────────────────────────────────────────────────────
Q: "What's the difference between a mutex and a semaphore?"
GREAT ANSWER:
"Both are synchronization primitives, but with different semantics:
Mutex (Mutual Exclusion):
- Binary: locked or unlocked
- Ownership: Thread that locks must unlock
- Purpose: Protect critical sections (one thread at a time)
- Use case: Protecting shared data structure
Semaphore:
- Counting: Can allow N threads (not just 1)
- No ownership: Any thread can signal (increment)
- Purpose: Control access to limited resources
- Use case: Connection pool with max 10 connections
Example differences:
Mutex for protecting a map:
mutex.lock()
map[key] = value // Only one thread here
mutex.unlock()
Semaphore for connection pool:
sem = Semaphore(10) // Allow 10 concurrent
sem.acquire() // Blocks if 10 are already held
conn = pool.get()
// use connection
pool.release(conn)
sem.release() // Allow another thread in
Binary semaphore (count=1) looks like a mutex but lacks ownership —
any thread can release it. This makes it suitable for signaling between
threads (producer signals consumer) but not for protecting critical sections.
I prefer mutex for protection, semaphore for resource counting."
Summary
┌────────────────────────────────────────────────────────────────────────┐
│ OS CONCEPTS KEY TAKEAWAYS │
│ │
│ PROCESSES & THREADS: │
│ • Process: isolated memory, heavy, multi-machine scalable │
│ • Thread: shared memory, light, single-machine only │
│ • Modern: hybrid (Go goroutines, event loops + thread pools) │
│ │
│ SYNCHRONIZATION: │
│ • Mutex: one thread at a time, has ownership │
│ • Semaphore: N threads, counting, no ownership │
│ • RWLock: many readers OR one writer │
│ • Atomic: lock-free, hardware-supported │
│ │
│ MEMORY: │
│ • Virtual memory: isolation, simplification, overcommit │
│ • Page fault: minor (fast) vs major (disk I/O, slow) │
│ • Cache hierarchy: L1 (1ns) → L2 (4ns) → L3 (20ns) → RAM (100ns) │
│ • Cache-friendly code: sequential access, SoA over AoS │
│ │
│ I/O MODELS: │
│ • Blocking: simple but doesn't scale │
│ • Non-blocking + epoll: handles many connections efficiently │
│ • io_uring: next-gen async I/O on Linux │
│ • Event loop: single thread handles thousands of connections │
│ │
│ FILE SYSTEMS: │
│ • Page cache: OS caches file data in RAM │
│ • fsync(): ensures durability (flush to disk) │
│ • WAL: write-ahead log for crash recovery │
│ │
│ SYSTEM TUNING: │
│ • File descriptors: increase ulimit for many connections │
│ • TCP: somaxconn, port range, buffer sizes │
│ • Memory: swappiness, dirty ratios │
│ │
│ CONTAINERS: │
│ • Namespaces: isolation (PID, NET, MNT, etc.) │
│ • Cgroups: resource limits (CPU, memory, I/O) │
│ │
└────────────────────────────────────────────────────────────────────────┘
📚 Further Reading
Books
- Operating Systems: Three Easy Pieces (free online)
- Linux Kernel Development by Robert Love
- Understanding the Linux Kernel by Bovet & Cesati
- Systems Performance by Brendan Gregg
Documentation
- Linux man pages: https://man7.org/linux/man-pages/
- The Linux Kernel documentation: https://www.kernel.org/doc/html/latest/
Tools & Observability
- perf: Linux profiling
- strace: System call tracing
- vmstat: Virtual memory statistics
- iostat: I/O statistics
- htop/top: Process monitoring
- bpftrace: Dynamic tracing
Blog Posts & Articles
- Brendan Gregg's Blog: https://www.brendangregg.com/
- Julia Evans' Zines: https://jvns.ca/
- LWN.net: https://lwn.net/ (Linux Weekly News)
Videos
- MIT 6.S081 Operating System Engineering: https://pdos.csail.mit.edu/6.828/
- CS162 Berkeley OS Course: https://cs162.org/
End of Week 0 — Part 5: Operating System Concepts
Week 0 Complete! You now have the foundation to dive into system design patterns in Week 1 — Foundations of Scale.