Himanshu Kukreja
0%
LearnSystem DesignFoundationsPart 5 Operating Systems
Foundation

Week 0 — Part 5: Operating System Concepts for System Design

System Design Mastery Series


Preface

Every distributed system runs on operating systems. When you say "the server handles 10,000 requests per second," you're relying on OS concepts you might not consciously think about.

Your application code sits on top of:

  ┌─────────────────────────────────────────────────────────────────┐
  │                    YOUR APPLICATION                             │
  ├─────────────────────────────────────────────────────────────────┤
  │              Runtime (Python, JVM, Go, Node.js)                 │
  ├─────────────────────────────────────────────────────────────────┤
  │                   System Libraries (libc)                       │
  ├─────────────────────────────────────────────────────────────────┤
  │                   Operating System Kernel                       │
  │   • Process Management    • Memory Management                   │
  │   • File Systems          • I/O Subsystem                       │
  │   • Networking Stack      • Device Drivers                      │
  ├─────────────────────────────────────────────────────────────────┤
  │                        Hardware                                 │
  └─────────────────────────────────────────────────────────────────┘

Understanding OS concepts helps you:
  - Tune your systems for performance
  - Debug production issues
  - Make informed architecture decisions
  - Answer interview questions confidently

This document covers everything about operating systems you need for system design interviews.


Part I: Processes and Threads

Chapter 1: Processes

1.1 What Is a Process?

PROCESS = Running instance of a program

A process has:
┌────────────────────────────────────────────────────────────────────────┐
│                         PROCESS                                        │
│                                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      MEMORY LAYOUT                              │   │
│  │                                                                 │   │
│  │  High addresses                                                 │   │
│  │  ┌──────────────────────┐                                       │   │
│  │  │        Stack         │  ← Local variables, function calls    │   │
│  │  │          ↓           │    Grows downward                     │   │
│  │  ├──────────────────────┤                                       │   │
│  │  │          ↑           │                                       │   │
│  │  │        Heap          │  ← Dynamic allocation (malloc/new)    │   │
│  │  │                      │    Grows upward                       │   │
│  │  ├──────────────────────┤                                       │   │
│  │  │        BSS           │  ← Uninitialized global variables     │   │
│  │  ├──────────────────────┤                                       │   │
│  │  │        Data          │  ← Initialized global variables       │   │
│  │  ├──────────────────────┤                                       │   │
│  │  │        Text          │  ← Program code (read-only)           │   │
│  │  └──────────────────────┘                                       │   │
│  │  Low addresses                                                  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                        │
│  + Process ID (PID)                                                    │
│  + Open file descriptors                                               │
│  + Environment variables                                               │
│  + CPU register state                                                  │
│  + Scheduling priority                                                 │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

1.2 Process States

PROCESS LIFECYCLE

                        ┌─────────────────┐
                        │      NEW        │
                        │   (Created)     │
                        └────────┬────────┘
                                 │ Admitted
                                 ▼
    ┌─────────────────────────────────────────────────────┐
    │                                                     │
    │  ┌─────────────┐    Scheduled    ┌─────────────┐    │
    │  │             │ ──────────────▶ │             │    │
    │  │    READY    │                 │   RUNNING   │    │
    │  │  (Waiting   │ ◀────────────── │ (Executing) │    │
    │  │  for CPU)   │   Preempted     │             │    │
    │  └──────┬──────┘                 └──────┬──────┘    │
    │         │                               │           │
    │         │         ┌─────────────┐       │           │
    │         │         │             │       │           │
    │         └────────▶│   WAITING   │◀──────┘           │
    │    I/O complete   │  (Blocked)  │   I/O request     │
    │                   │             │                   │
    │                   └─────────────┘                   │
    │                                                     │
    └─────────────────────────────────────────────────────┘
                                 │
                                 │ Exit
                                 ▼
                        ┌─────────────────┐
                        │   TERMINATED    │
                        │    (Zombie)     │
                        └─────────────────┘

STATES EXPLAINED:
  NEW:        Process being created
  READY:      Waiting for CPU time
  RUNNING:    Currently executing on CPU
  WAITING:    Blocked on I/O or event
  TERMINATED: Finished, waiting for parent to read exit status

1.3 Process Creation

FORK AND EXEC (Unix/Linux)

Parent Process                          Child Process
      │                                       
      │  fork()                              
      │────────────────────────────────▶ │    
      │                                  │   Created as copy of parent
      │                                  │   Gets new PID
      │                                  │   Shares open file descriptors
      │                                  │
      │  (continues)                     │  exec("/bin/program")
      │                                  │  Replaces memory with new program
      │                                  │
      │  wait()                          │  (runs program)
      │  ...                             │
      │  ◀──── exit status ────────────  │  exit(0)
      │                                       

FORK() RETURN VALUES:
  In parent: Returns child's PID (positive number)
  In child:  Returns 0
  On error:  Returns -1

COPY-ON-WRITE (COW):
  - Child doesn't immediately copy parent's memory
  - Pages marked read-only, shared between parent/child
  - Only copied when either tries to write
  - Makes fork() very fast even for large processes

1.4 Inter-Process Communication (IPC)

┌─────────────────────────────────────────────────────────────────────────┐
│                    IPC MECHANISMS                                       │
│                                                                         │
│  1. PIPES (Anonymous)                                                   │
│  ─────────────────────────────────────────────────────────────────────  │
│     Process A ───[write]───▶ Pipe ───[read]───▶ Process B               │
│                                                                         │
│     - Unidirectional                                                    │
│     - Parent-child communication                                        │
│     - Example: shell pipes (ls | grep foo)                              │
│                                                                         │
│  2. NAMED PIPES (FIFOs)                                                 │
│  ─────────────────────────────────────────────────────────────────────  │
│     Any Process ───▶ /tmp/myfifo ───▶ Any Process                       │
│                                                                         │
│     - Has filesystem name                                               │
│     - Unrelated processes can communicate                               │
│                                                                         │
│  3. MESSAGE QUEUES                                                      │
│  ─────────────────────────────────────────────────────────────────────  │
│     Process A ──[msg]──▶ │ Queue │ ──[msg]──▶ Process B                 │
│                                                                         │
│     - Structured messages with types                                    │
│     - Persist until read                                                │
│     - System V or POSIX variants                                        │
│                                                                         │
│  4. SHARED MEMORY                                                       │
│  ─────────────────────────────────────────────────────────────────────  │
│     Process A ─────┐     ┌───── Process B                               │
│                    ▼     ▼                                              │
│              ┌───────────────┐                                          │
│              │ Shared Memory │                                          │
│              │    Region     │                                          │
│              └───────────────┘                                          │
│                                                                         │
│     - Fastest IPC (no kernel involvement after setup)                   │
│     - Requires synchronization (mutexes, semaphores)                    │
│     - Used by: databases, Redis                                         │
│                                                                         │
│  5. SOCKETS                                                             │
│  ─────────────────────────────────────────────────────────────────────  │
│     Process A ◀══[TCP/UDP/Unix socket]══▶ Process B                     │
│                                                                         │
│     - Network or local (Unix domain sockets)                            │
│     - Bidirectional                                                     │
│     - Most flexible, used everywhere                                    │
│                                                                         │
│  6. SIGNALS                                                             │
│  ─────────────────────────────────────────────────────────────────────  │
│     Process A ──[SIGTERM]──▶ Process B                                  │
│                                                                         │
│     - Asynchronous notifications                                        │
│     - Limited information (just signal number)                          │
│     - Examples: SIGKILL, SIGTERM, SIGHUP, SIGUSR1                       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

SYSTEM DESIGN RELEVANCE:

  Microservices typically use:
    - HTTP/gRPC over sockets (most common)
    - Message queues (Kafka, RabbitMQ, SQS)
    
  Same-machine optimization:
    - Unix domain sockets (faster than TCP loopback)
    - Shared memory (databases, caches)

Chapter 2: Threads

2.1 Threads vs Processes

┌────────────────────────────────────────────────────────────────────────┐
│                    PROCESS vs THREAD                                   │
│                                                                        │
│  PROCESS                              THREAD                           │
│  ─────────────────────────────────    ─────────────────────────────    │
│                                                                        │
│  ┌──────────────────┐                ┌──────────────────┐              │
│  │ Memory Space     │                │ Stack            │              │
│  │ ┌──────────────┐ │                │ (own)            │              │
│  │ │    Stack     │ │                │                  │              │
│  │ ├──────────────┤ │                │ Registers        │              │
│  │ │    Heap      │ │                │ (own)            │              │
│  │ ├──────────────┤ │                │                  │              │
│  │ │    Data      │ │                │ Thread ID        │              │
│  │ ├──────────────┤ │                │ (own)            │              │
│  │ │    Text      │ │                └──────────────────┘              │
│  │ └──────────────┘ │                        │                         │
│  │                  │                        │ Shares                  │
│  │ File Descriptors │                        ▼                         │
│  │ PID              │                ┌──────────────────┐              │
│  │ Environment      │                │ Heap             │              │
│  └──────────────────┘                │ Data             │              │
│                                      │ Text (Code)      │              │
│  Isolated from others                │ File Descriptors │              │
│  Heavy to create (~10ms)             │ PID              │              │
│  Context switch: ~1000 μs            └──────────────────┘              │
│                                                                        │
│                                      Share with other threads          │
│                                      Light to create (~10μs)           │
│                                      Context switch: ~1-10 μs          │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

KEY DIFFERENCES:

| Aspect           | Process              | Thread               |
|------------------|----------------------|----------------------|
| Memory           | Isolated             | Shared               |
| Creation cost    | High (~10ms)         | Low (~10μs)          |
| Context switch   | Expensive            | Cheap                |
| Communication    | IPC needed           | Shared memory        |
| Crash isolation  | One crash, one dead  | One crash, all dead  |
| Scalability      | Multi-machine        | Single machine       |

2.2 Thread Models

THREADING MODELS

1. KERNEL THREADS (1:1)
   ─────────────────────────────────────────────────────────────────────
   Each user thread = one kernel thread
   
   User Space:     [T1]    [T2]    [T3]
                    │       │       │
   Kernel Space:   [K1]    [K2]    [K3]
   
   Pros: True parallelism, simple
   Cons: Kernel overhead per thread
   Used by: Linux pthreads, Java threads

2. USER THREADS (N:1)
   ─────────────────────────────────────────────────────────────────────
   Many user threads = one kernel thread
   
   User Space:     [T1]  [T2]  [T3]  [T4]
                     \    |    |    /
   Kernel Space:        [K1]
   
   Pros: Fast creation/switching
   Cons: No parallelism, one block = all block
   Used by: Old green threads

3. HYBRID (M:N)
   ─────────────────────────────────────────────────────────────────────
   Many user threads = some kernel threads
   
   User Space:     [T1]  [T2]  [T3]  [T4]  [T5]  [T6]
                     \    |    /      \    |    /
   Kernel Space:      [K1]             [K2]
   
   Pros: Parallelism + cheap user threads
   Cons: Complex scheduling
   Used by: Go goroutines, Erlang processes

2.3 Concurrency vs Parallelism

CONCURRENCY vs PARALLELISM

CONCURRENCY (dealing with many things at once):
─────────────────────────────────────────────────────────────────────
  Single core, multiple tasks, interleaved

  Time →
  Core 1: [Task A][Task B][Task A][Task C][Task B][Task A]
  
  Tasks make progress, but not simultaneously
  Useful for I/O-bound work

PARALLELISM (doing many things at once):
─────────────────────────────────────────────────────────────────────
  Multiple cores, tasks run simultaneously
  
  Time →
  Core 1: [Task A][Task A][Task A][Task A]
  Core 2: [Task B][Task B][Task B][Task B]
  Core 3: [Task C][Task C][Task C][Task C]
  
  True simultaneous execution
  Useful for CPU-bound work

YOU CAN HAVE:
  - Concurrency without parallelism (single core, multiple tasks)
  - Parallelism without concurrency (multiple cores, one task per core)
  - Both (multiple cores, many tasks)

2.4 Thread Safety and Synchronization

THREAD SAFETY PROBLEMS

1. RACE CONDITION
   ─────────────────────────────────────────────────────────────────────
   Two threads access shared data, at least one writes
   Result depends on timing (non-deterministic)
   
   Thread A:  read counter (0)
                              Thread B: read counter (0)
   Thread A:  counter = 0 + 1
                              Thread B: counter = 0 + 1
   Thread A:  write counter (1)
                              Thread B: write counter (1)
   
   Expected: 2    Actual: 1   ← BUG!

2. DEADLOCK
   ─────────────────────────────────────────────────────────────────────
   Two threads each hold a lock the other needs
   
   Thread A: lock(A)           Thread B: lock(B)
   Thread A: waiting for B     Thread B: waiting for A
   
   Both stuck forever!
   
   Prevention:
   - Lock ordering (always acquire in same order)
   - Lock timeout
   - Deadlock detection

3. LIVELOCK
   ─────────────────────────────────────────────────────────────────────
   Threads keep responding to each other but make no progress
   Like two people in hallway both stepping aside
   
4. STARVATION
   ─────────────────────────────────────────────────────────────────────
   One thread never gets resources (always preempted)

2.5 Synchronization Primitives

┌────────────────────────────────────────────────────────────────────────┐
│                    SYNCHRONIZATION PRIMITIVES                          │
│                                                                        │
│  1. MUTEX (Mutual Exclusion)                                           │
│  ────────────────────────────────────────────────────────────────────  │
│     Only one thread can hold at a time                                 │
│                                                                        │
│     mutex.lock()                                                       │
│       // Critical section - only one thread here                       │
│     mutex.unlock()                                                     │
│                                                                        │
│     Use for: Protecting shared data                                    │
│                                                                        │
│  2. SEMAPHORE                                                          │
│  ────────────────────────────────────────────────────────────────────  │
│     Counter that allows N threads                                      │
│                                                                        │
│     sem = Semaphore(3)  // Allow 3 concurrent                          │
│     sem.acquire()       // Decrement (block if 0)                      │
│       // Up to 3 threads here                                          │
│     sem.release()       // Increment                                   │
│                                                                        │
│     Use for: Connection pools, rate limiting                           │
│                                                                        │
│  3. READ-WRITE LOCK (RWLock)                                           │
│  ────────────────────────────────────────────────────────────────────  │
│     Multiple readers OR single writer                                  │
│                                                                        │
│     rwlock.read_lock()    // Many can hold simultaneously              │
│       // Reading shared data                                           │
│     rwlock.read_unlock()                                               │
│                                                                        │
│     rwlock.write_lock()   // Exclusive access                          │
│       // Modifying shared data                                         │
│     rwlock.write_unlock()                                              │
│                                                                        │
│     Use for: Read-heavy workloads                                      │
│                                                                        │
│  4. CONDITION VARIABLE                                                 │
│  ────────────────────────────────────────────────────────────────────  │
│     Wait for condition, wake when signaled                             │
│                                                                        │
│     Producer:                     Consumer:                            │
│       mutex.lock()                  mutex.lock()                       │
│       queue.push(item)              while queue.empty():               │
│       cond.signal()                   cond.wait(mutex)                 │
│       mutex.unlock()                item = queue.pop()                 │
│                                     mutex.unlock()                     │
│                                                                        │
│     Use for: Producer-consumer, thread pools                           │
│                                                                        │
│  5. SPINLOCK                                                           │
│  ────────────────────────────────────────────────────────────────────  │
│     Busy-wait instead of sleeping                                      │
│                                                                        │
│     while (!lock.try_acquire()):                                       │
│       // spin (waste CPU)                                              │
│     // critical section                                                │
│     lock.release()                                                     │
│                                                                        │
│     Use for: Very short critical sections (kernel code)                │
│     Avoid for: User-space, long waits                                  │
│                                                                        │
│  6. ATOMIC OPERATIONS                                                  │
│  ────────────────────────────────────────────────────────────────────  │
│     Hardware-supported indivisible operations                          │
│                                                                        │
│     atomic_int counter = 0;                                            │
│     counter.fetch_add(1);  // Atomic increment                         │
│     counter.compare_exchange(expected, desired);  // CAS               │
│                                                                        │
│     Use for: Counters, lock-free data structures                       │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

2.6 Lock-Free Programming

COMPARE-AND-SWAP (CAS)

Atomic operation that:
1. Compares value to expected
2. If equal, swaps with new value
3. Returns whether swap happened

bool CAS(address, expected, new_value):
    // This is atomic (hardware instruction)
    if *address == expected:
        *address = new_value
        return true
    return false

LOCK-FREE COUNTER EXAMPLE:

class LockFreeCounter:
    def __init__(self):
        self.value = AtomicInt(0)
    
    def increment(self):
        while True:
            old = self.value.get()
            new = old + 1
            if self.value.compare_and_swap(old, new):
                return new
            # CAS failed, another thread changed it, retry

WHY LOCK-FREE?
  - No deadlocks
  - No priority inversion
  - Progress guarantee (someone always makes progress)
  
DOWNSIDES:
  - Complex to implement correctly
  - Can have contention (many retries)
  - Hard to reason about

Chapter 3: Context Switching

3.1 What Happens in a Context Switch

CONTEXT SWITCH PROCESS

When OS switches from Process A to Process B:

1. SAVE STATE OF PROCESS A
   ─────────────────────────────────────────────────────────────────────
   - All CPU registers (general purpose, program counter, stack pointer)
   - Memory management info (page table pointer)
   - Floating point state
   → Saved to Process A's Process Control Block (PCB)

2. UPDATE SCHEDULER
   ─────────────────────────────────────────────────────────────────────
   - Mark Process A as READY (or WAITING)
   - Select Process B to run
   - Mark Process B as RUNNING

3. RESTORE STATE OF PROCESS B
   ─────────────────────────────────────────────────────────────────────
   - Load registers from Process B's PCB
   - Load page table pointer (TLB flush!)
   - Jump to Process B's program counter

TIME COSTS:

  Direct costs:
    - Register save/restore: ~100-1000 cycles
    - Page table switch: ~1000+ cycles
    
  Indirect costs (often bigger!):
    - TLB flush: Cache misses for a while
    - CPU cache pollution: Process B's data not in L1/L2
    - Pipeline flush: Branch prediction invalid

  Total: 1-10 μs for threads, 10-100 μs for processes

3.2 Minimizing Context Switches

STRATEGIES TO REDUCE CONTEXT SWITCHES

1. USE ASYNC I/O (Event Loop)
   ─────────────────────────────────────────────────────────────────────
   One thread handles many connections
   No blocking = no forced context switch
   
   Examples: Node.js, nginx, Redis

2. THREAD POOLS
   ─────────────────────────────────────────────────────────────────────
   Fixed number of threads, reused
   Avoid thread creation/destruction overhead
   
   pool = ThreadPool(num_cores * 2)
   pool.submit(task)

3. CPU AFFINITY
   ─────────────────────────────────────────────────────────────────────
   Pin threads to specific CPU cores
   Better cache utilization
   
   taskset -c 0,1 ./myprogram

4. BATCH PROCESSING
   ─────────────────────────────────────────────────────────────────────
   Process multiple items per wake-up
   Amortize context switch cost
   
   while items = queue.get_batch(100):
       process(items)

5. COOPERATIVE SCHEDULING
   ─────────────────────────────────────────────────────────────────────
   Coroutines yield voluntarily
   No preemptive switches
   
   Examples: Python asyncio, Go goroutines

Part II: Memory Management

Chapter 4: Virtual Memory

4.1 Why Virtual Memory?

VIRTUAL MEMORY BENEFITS

1. ISOLATION
   Each process thinks it has entire address space
   Can't accidentally (or maliciously) access other process's memory
   
2. SIMPLIFICATION
   Programmers use simple linear addresses
   OS handles mapping to physical memory
   
3. OVERCOMMIT
   Can allocate more virtual memory than physical RAM
   Pages stored on disk when not in use (swap)
   
4. SHARED MEMORY
   Multiple processes can map same physical page
   Used for: shared libraries, IPC

VIRTUAL ADDRESS SPACE (64-bit Linux):

  0xFFFFFFFFFFFFFFFF ┌──────────────────────┐
                     │    Kernel Space      │  Top 128 TB
                     │  (not accessible     │
                     │   from user mode)    │
  0xFFFF800000000000 ├──────────────────────┤
                     │                      │
                     │    (huge gap)        │  Non-canonical addresses
                     │                      │
  0x00007FFFFFFFFFFF ├──────────────────────┤
                     │    Stack (grows ↓)   │
                     │    ...               │  User Space
                     │    Heap (grows ↑)    │  Bottom 128 TB
                     │    Data              │
                     │    Text              │
  0x0000000000000000 └──────────────────────┘

4.2 Paging

PAGING: VIRTUAL TO PHYSICAL ADDRESS TRANSLATION

Virtual Address                    Physical Address
┌───────────────────────────┐      ┌───────────────────────────┐
│ Page Number │ Offset      │      │ Frame Number │ Offset     │
│ (VPN)       │             │      │ (PFN)        │            │
└──────┬──────┴─────────────┘      └──────────────┴────────────┘
       │                                  ▲
       │                                  │
       │    ┌────────────────────┐        │
       └───▶│    Page Table      │────────┘
            │                    │
            │  VPN 0 → PFN 3     │
            │  VPN 1 → PFN 7     │
            │  VPN 2 → [disk]    │  ← Page fault!
            │  VPN 3 → PFN 1     │
            │  ...               │
            └────────────────────┘

PAGE SIZE:
  - Typical: 4 KB
  - Huge pages: 2 MB or 1 GB (for databases, VMs)
  
  4 KB pages for 4 GB RAM = 1 million page table entries!
  Solution: Multi-level page tables

MULTI-LEVEL PAGE TABLE (x86-64):

  Virtual Address (48 bits used):
  ┌────────┬────────┬────────┬────────┬──────────────┐
  │ PML4   │ PDPT   │  PD    │  PT    │   Offset     │
  │ 9 bits │ 9 bits │ 9 bits │ 9 bits │   12 bits    │
  └───┬────┴───┬────┴───┬────┴───┬────┴──────────────┘
      │        │        │        │
      ▼        ▼        ▼        ▼
   Level 4 → Level 3 → Level 2 → Level 1 → Physical Frame
   
  Sparse tables: Only allocate entries that are used

4.3 Translation Lookaside Buffer (TLB)

TLB: PAGE TABLE CACHE

Page table lookups are slow (multiple memory accesses)
TLB caches recent translations

┌────────────────────────────────────────────────────────────────────────┐
│                         TLB                                            │
│                                                                        │
│  CPU ──▶ Virtual Address                                               │
│              │                                                         │
│              ▼                                                         │
│         ┌─────────┐    HIT     ┌─────────────────────┐                 │
│         │   TLB   │──────────▶│ Physical Address    │──▶ Memory        │
│         │ (Cache) │            └─────────────────────┘                 │
│         └────┬────┘                                                    │
│              │ MISS                                                    │
│              ▼                                                         │
│         ┌───────────┐                                                  │
│         │Page Table │  (slow: multiple memory reads)                   │
│         │  Walk     │                                                  │
│         └─────┬─────┘                                                  │
│               │                                                        │
│               ▼                                                        │
│         Update TLB, return Physical Address                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

TLB CHARACTERISTICS:
  - Size: 64-1024 entries (tiny!)
  - Hit rate: 99%+ for well-behaved programs
  - Miss penalty: 10-100 cycles
  
TLB FLUSH:
  - Happens on process switch (page table changes)
  - Expensive! All cached translations lost
  - PCID (Process Context ID) can help avoid full flush

4.4 Page Faults

PAGE FAULT TYPES

1. MINOR PAGE FAULT (Soft)
   ─────────────────────────────────────────────────────────────────────
   Page is in memory but not mapped in page table
   
   Causes:
   - First access to newly allocated memory
   - Copy-on-write after fork()
   
   Resolution: Update page table, no disk I/O
   Cost: ~1-10 μs

2. MAJOR PAGE FAULT (Hard)
   ─────────────────────────────────────────────────────────────────────
   Page is not in memory, must read from disk
   
   Causes:
   - Page was swapped out
   - Memory-mapped file access
   
   Resolution: Read page from disk
   Cost: ~1-10 ms (1000x slower than minor!)

3. INVALID PAGE FAULT
   ─────────────────────────────────────────────────────────────────────
   Access to unmapped or protected memory
   
   Causes:
   - Null pointer dereference
   - Stack overflow
   - Writing to read-only memory
   
   Resolution: SIGSEGV (crash)

4.5 Memory Allocation

MEMORY ALLOCATION STRATEGIES

USER SPACE (malloc/free):

  ┌────────────────────────────────────────────────────────────────────┐
  │                         HEAP                                       │
  │                                                                    │
  │  ┌────┬────┬────────┬────┬──────────┬────┬─────────────────────┐   │
  │  │Used│Free│  Used  │Free│   Used   │Free│    Unallocated      │   │
  │  │32B │64B │  128B  │32B │   256B   │128B│                     │   │
  │  └────┴────┴────────┴────┴──────────┴────┴─────────────────────┘   │
  │                                                                    │
  │  Allocators: glibc malloc, jemalloc, tcmalloc                      │
  │                                                                    │
  └────────────────────────────────────────────────────────────────────┘

ALLOCATION STRATEGIES:

  1. First Fit:  Use first free block that fits
  2. Best Fit:   Use smallest free block that fits
  3. Buddy System: Powers of 2, split/merge
  
FRAGMENTATION:
  
  External: Free memory scattered in small pieces
  
  ┌────┬────┬────┬────┬────┬────┬────┬────┐
  │Used│Free│Used│Free│Used│Free│Used│Free│
  │    │64B │    │32B │    │64B │    │32B │  Total free: 192B
  └────┴────┴────┴────┴────┴────┴────┴────┘  But can't allocate 100B!
  
  Internal: Allocated block larger than requested
  
  Requested: 100 bytes
  Allocated: 128 bytes (next power of 2)
  Wasted: 28 bytes

Chapter 5: Caching and Memory Hierarchy

5.1 Memory Hierarchy

MEMORY HIERARCHY (Latency & Size)

                ┌─────────────┐
                │  Registers  │  < 1 ns, bytes
                └──────┬──────┘
                       │
                ┌──────┴──────┐
                │  L1 Cache   │  ~1 ns, 32-64 KB (per core)
                └──────┬──────┘
                       │
                ┌──────┴──────┐
                │  L2 Cache   │  ~4 ns, 256 KB - 1 MB (per core)
                └──────┬──────┘
                       │
                ┌──────┴──────┐
                │  L3 Cache   │  ~10-20 ns, 8-64 MB (shared)
                └──────┬──────┘
                       │
                ┌──────┴──────┐
                │    RAM      │  ~100 ns, 16-512 GB
                └──────┬──────┘
                       │
                ┌──────┴──────┐
                │    SSD      │  ~100 μs, 256 GB - 8 TB
                └──────┬──────┘
                       │
                ┌──────┴──────┐
                │    HDD      │  ~10 ms, 1-20 TB
                └─────────────┘

LATENCY REFERENCE (Approximate):
  L1 cache reference:                   1 ns
  L2 cache reference:                   4 ns
  L3 cache reference:                  20 ns
  RAM reference:                      100 ns
  SSD random read:                100,000 ns  (100 μs)
  HDD seek:                    10,000,000 ns  (10 ms)
  
  L1 is 100x faster than RAM
  RAM is 100x faster than SSD
  SSD is 100x faster than HDD

5.2 CPU Cache Behavior

CPU CACHE CONCEPTS

CACHE LINE:
  - Unit of data transfer between cache levels
  - Typically 64 bytes
  - Even if you read 1 byte, 64 bytes are fetched
  
  Memory: [........][........][........][........]
             │           │
             └───────────┴── Each block is 64 bytes

CACHE ASSOCIATIVITY:
  
  Direct Mapped (1-way):
    Each memory address maps to exactly one cache location
    Fast but conflicts
    
  N-way Set Associative:
    Address can be in N different locations
    Better hit rate, slightly slower
    
  Fully Associative:
    Can be anywhere in cache
    Best hit rate, slowest lookup

CACHE COHERENCE (Multi-core):
  
  Problem: Multiple cores have copies of same data
  
  Core 1 cache: [x = 5]
  Core 2 cache: [x = 5]
  
  Core 1: x = 10  (updates its cache)
  Core 2: reads x  (sees stale value 5!)
  
  Solution: MESI Protocol
    Modified:  Only this cache has it, dirty
    Exclusive: Only this cache has it, clean
    Shared:    Multiple caches have it, clean
    Invalid:   Cache line is invalid

5.3 Cache-Friendly Code

CACHE-FRIENDLY PROGRAMMING

1. SEQUENTIAL ACCESS (Good)
   ─────────────────────────────────────────────────────────────────────
   // Array stored: [0][1][2][3][4][5][6][7]
   
   for (int i = 0; i < N; i++)
       sum += array[i];  // Sequential, cache-friendly
   
   Each cache line (64B) loads ~16 integers
   Next iteration hits cache (spatial locality)

2. STRIDED ACCESS (Bad)
   ─────────────────────────────────────────────────────────────────────
   // 2D array stored row-major: [0,0][0,1][0,2]...[1,0][1,1]...
   
   // BAD: Column-major traversal
   for (int j = 0; j < cols; j++)
       for (int i = 0; i < rows; i++)
           sum += matrix[i][j];  // Jumping around, cache misses!
   
   // GOOD: Row-major traversal
   for (int i = 0; i < rows; i++)
       for (int j = 0; j < cols; j++)
           sum += matrix[i][j];  // Sequential, cache-friendly

3. STRUCTURE OF ARRAYS (SoA) vs ARRAY OF STRUCTURES (AoS)
   ─────────────────────────────────────────────────────────────────────
   // AoS (Array of Structures) - common but often slower
   struct Entity { float x, y, z; int health; bool active; };
   Entity entities[1000];
   
   // To sum all x values: loads entire struct but only uses x
   for (i = 0..999) sum += entities[i].x;  // Cache pollution!
   
   // SoA (Structure of Arrays) - cache-friendly
   struct Entities {
       float x[1000], y[1000], z[1000];
       int health[1000];
       bool active[1000];
   };
   
   // To sum all x values: only loads x array
   for (i = 0..999) sum += entities.x[i];  // Cache-friendly!

5.4 NUMA (Non-Uniform Memory Access)

NUMA ARCHITECTURE

In multi-socket systems, memory access time depends on location:

  ┌───────────────────────────────────────────────────────────────────┐
  │                                                                   │
  │   Socket 0                              Socket 1                  │
  │  ┌────────────────────┐              ┌────────────────────┐       │
  │  │ CPU 0    CPU 1     │              │ CPU 2    CPU 3     │       │
  │  │  ↓        ↓        │              │  ↓        ↓        │       │
  │  │ ┌──────────────┐   │   QPI/UPI    │ ┌──────────────┐   │       │
  │  │ │   Memory     │   │◀───────────▶ │ │   Memory     │   │       │
  │  │ │ Controller   │   │   (slow)     │ │ Controller   │   │       │
  │  │ └──────┬───────┘   │              │ └──────┬───────┘   │       │
  │  │        │           │              │        │           │       │
  │  │  ┌─────┴─────┐     │              │  ┌─────┴─────┐     │       │
  │  │  │ RAM Node 0│     │              │  │ RAM Node 1│     │       │
  │  │  │   64 GB   │     │              │  │   64 GB   │     │       │
  │  │  └───────────┘     │              │  └───────────┘     │       │
  │  └────────────────────┘              └────────────────────┘       │
  │                                                                   │
  │  Local access (Node 0 → RAM 0):  ~100 ns                          │
  │  Remote access (Node 0 → RAM 1): ~150-200 ns  (1.5-2x slower!)    │
  │                                                                   │
  └───────────────────────────────────────────────────────────────────┘

NUMA-AWARE PROGRAMMING:

  1. Allocate memory on local node
     numactl --membind=0 ./myprogram
     
  2. Bind threads to cores on same node as their data
     numactl --cpunodebind=0 ./myprogram
     
  3. Database optimization
     - Partition data by NUMA node
     - Pin worker threads to nodes

Part III: I/O Models

Chapter 6: I/O Fundamentals

6.1 Blocking vs Non-Blocking I/O

BLOCKING I/O

  Thread                                    Kernel
     │                                        │
     │  read(fd, buf, size)                   │
     │ ──────────────────────────────────▶    │
     │                                        │
     │        [Thread blocked/sleeping]       │
     │                                        │  Waiting for
     │        [Cannot do other work]          │  data from
     │                                        │  disk/network
     │                                        │
     │  ◀─────────────────────────────────    │
     │         Data returned                  │
     │                                        │
     │  [Thread continues]                    │

Problem: Thread is idle while waiting
Solution: More threads (but threads are expensive)


NON-BLOCKING I/O

  Thread                                    Kernel
     │                                        │
     │  read(fd, buf, size)                   │
     │ ──────────────────────────────────▶    │
     │                                        │
     │  ◀── EAGAIN (would block) ──────────   │  No data yet
     │                                        │
     │  [Do other work]                       │
     │                                        │
     │  read(fd, buf, size)                   │
     │ ──────────────────────────────────▶    │
     │                                        │
     │  ◀── EAGAIN ─────────────────────────  │  Still no data
     │                                        │
     │  [Do other work]                       │
     │                                        │
     │  read(fd, buf, size)                   │
     │ ──────────────────────────────────▶    │
     │                                        │
     │  ◀── Data ───────────────────────────  │  Data ready!

Problem: Busy polling wastes CPU
Solution: Event notification (select/poll/epoll)

6.2 I/O Multiplexing

I/O MULTIPLEXING: MONITORING MANY FILE DESCRIPTORS

Instead of: One thread per connection
Use: One thread monitors many connections

┌────────────────────────────────────────────────────────────────────────┐
│                    I/O MULTIPLEXING EVOLUTION                          │
│                                                                        │
│  1. SELECT (1983)                                                      │
│  ────────────────────────────────────────────────────────────────────  │
│     ready = select(fds, read_set, write_set, error_set, timeout)       │
│                                                                        │
│     Limitations:                                                       │
│     - Max 1024 file descriptors (FD_SETSIZE)                           │
│     - O(n) kernel scan of all fds                                      │
│     - Must rebuild fd sets after each call                             │
│                                                                        │
│  2. POLL (1986)                                                        │
│  ────────────────────────────────────────────────────────────────────  │
│     ready = poll(fds[], nfds, timeout)                                 │
│                                                                        │
│     Improvements over select:                                          │
│     - No fd limit                                                      │
│     - Separate input/output events                                     │
│                                                                        │
│     Still has:                                                         │
│     - O(n) kernel scan                                                 │
│     - Copy entire array to/from kernel                                 │
│                                                                        │
│  3. EPOLL (Linux 2002)                                                 │
│  ────────────────────────────────────────────────────────────────────  │
│     epfd = epoll_create()                                              │
│     epoll_ctl(epfd, ADD, fd, events)  // Register once                 │
│     ready = epoll_wait(epfd, events, max, timeout)                     │
│                                                                        │
│     Advantages:                                                        │
│     - O(1) for monitoring (kernel maintains state)                     │
│     - Only returns ready fds (not all fds)                             │
│     - Edge-triggered mode for high performance                         │
│                                                                        │
│  4. KQUEUE (BSD/macOS)                                                 │
│  ────────────────────────────────────────────────────────────────────  │
│     Similar to epoll, used on BSD/macOS                                │
│                                                                        │
│  5. IOCP (Windows)                                                     │
│  ────────────────────────────────────────────────────────────────────  │
│     I/O Completion Ports                                               │
│     True async I/O (completion-based, not readiness-based)             │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

6.3 Event-Driven Architecture

EVENT LOOP (Single-Threaded Async)

  ┌────────────────────────────────────────────────────────────────────┐
  │                        EVENT LOOP                                  │
  │                                                                    │
  │   while (running):                                                 │
  │       events = epoll_wait(epfd, timeout)   ◀── Block until ready   │
  │                                                                    │
  │       for event in events:                                         │
  │           if event.type == READABLE:                               │
  │               data = read(event.fd)                                │
  │               process(data)                                        │
  │                                                                    │
  │           if event.type == WRITABLE:                               │
  │               write(event.fd, pending_data)                        │
  │                                                                    │
  │       run_timers()                                                 │
  │       run_callbacks()                                              │
  │                                                                    │
  └────────────────────────────────────────────────────────────────────┘

REACTOR PATTERN:

  ┌────────────┐      ┌────────────────┐      ┌────────────────┐
  │  Clients   │      │    Reactor     │      │   Handlers     │
  │            │      │  (Event Loop)  │      │                │
  │  Client 1 ─┼─────▶│                │─────▶│  HTTP Handler  │
  │  Client 2 ─┼─────▶│  Demultiplexer │─────▶│  WS Handler    │
  │  Client 3 ─┼─────▶│  (epoll)       │─────▶│  File Handler  │
  │            │      │                │      │                │
  └────────────┘      └────────────────┘      └────────────────┘

USED BY:
  - Node.js (libuv)
  - nginx
  - Redis
  - Python asyncio

6.4 Async I/O Patterns

I/O PATTERNS COMPARISON

1. THREAD-PER-CONNECTION
   ─────────────────────────────────────────────────────────────────────
   
   Main Thread: accept() → spawn new thread
   
   Thread 1: handle(client1) [blocking I/O]
   Thread 2: handle(client2) [blocking I/O]
   Thread 3: handle(client3) [blocking I/O]
   ...
   
   Pros: Simple programming model
   Cons: Threads expensive, doesn't scale past ~10K connections
   Used: Traditional web servers (Apache prefork)

2. THREAD POOL
   ─────────────────────────────────────────────────────────────────────
   
   Main Thread: accept() → queue task
   Worker Pool: threads pull from queue
   
   ┌──────────┐     ┌─────────┐     ┌──────────┐
   │ Acceptor │────▶│  Queue  │────▶│ Workers  │
   └──────────┘     └─────────┘     │ Thread 1 │
                                    │ Thread 2 │
                                    │ Thread 3 │
                                    └──────────┘
   
   Pros: Bounded threads, task queuing
   Cons: Still blocking I/O, limited concurrency
   Used: Java servlet containers

3. EVENT-DRIVEN (Single Thread)
   ─────────────────────────────────────────────────────────────────────
   
   Single Thread: epoll_wait → handle ready events → repeat
   
   Pros: Massive concurrency, no thread overhead
   Cons: Can't use multiple cores, callback hell
   Used: Node.js, Redis

4. HYBRID (Event Loop + Thread Pool)
   ─────────────────────────────────────────────────────────────────────
   
   Event Loop: handles I/O, delegates CPU work
   Thread Pool: handles CPU-intensive tasks
   
   ┌──────────────┐     ┌─────────────┐
   │  Event Loop  │────▶│ Thread Pool │
   │  (I/O bound) │◀────│ (CPU bound) │
   └──────────────┘     └─────────────┘
   
   Pros: Best of both worlds
   Cons: More complex
   Used: Node.js (libuv workers), nginx

5. MULTI-THREADED EVENT LOOPS
   ─────────────────────────────────────────────────────────────────────
   
   Multiple event loops, one per core
   
   Core 0: Event Loop 0 (handles connections 0, 4, 8, ...)
   Core 1: Event Loop 1 (handles connections 1, 5, 9, ...)
   Core 2: Event Loop 2 (handles connections 2, 6, 10, ...)
   Core 3: Event Loop 3 (handles connections 3, 7, 11, ...)
   
   Pros: Scales to multiple cores
   Cons: Cross-thread communication needed
   Used: nginx (worker processes), Go runtime

6.5 io_uring (Linux)

IO_URING: NEXT-GENERATION LINUX ASYNC I/O

Traditional async I/O problems:
  - epoll: Readiness-based (still need syscall to read)
  - aio: Limited (files only, not network)
  
io_uring (Linux 5.1+):
  - True async I/O for files AND network
  - Submission and completion queues in shared memory
  - Batching: Submit many operations with one syscall

┌────────────────────────────────────────────────────────────────────────┐
│                         IO_URING                                       │
│                                                                        │
│   User Space                              Kernel Space                 │
│                                                                        │
│   ┌───────────────────┐                  ┌───────────────────┐         │
│   │  Submission Queue │    Shared        │                   │         │
│   │  (SQ)             │◀──Memory────────▶│    Kernel         │         │
│   │                   │                  │    Processing     │         │
│   │  [op1][op2][op3]  │                  │                   │         │
│   └───────────────────┘                  └───────────────────┘         │
│                                                                        │
│   ┌───────────────────┐                                                │
│   │  Completion Queue │    User polls for completions                  │
│   │  (CQ)             │    No syscall needed!                          │
│   │                   │                                                │
│   │  [result1][res2]  │                                                │
│   └───────────────────┘                                                │
│                                                                        │
│   Benefits:                                                            │
│   - Batch submissions (one syscall for many ops)                       │
│   - Zero-copy where possible                                           │
│   - Poll mode: no syscalls at all                                      │
│   - Supports: read, write, accept, connect, send, recv, etc.           │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Chapter 7: File Systems

7.1 File System Basics

FILE SYSTEM STRUCTURE

                    ┌─────────────────────────────────────────┐
                    │              USER SPACE                 │
                    │                                         │
                    │  Application: open("/data/file.txt")    │
                    │                    │                    │
                    └────────────────────┼────────────────────┘
                                         │ System Call
                    ┌────────────────────┼────────────────────┐
                    │              KERNEL SPACE               │
                    │                    ▼                    │
                    │  ┌─────────────────────────────────┐    │
                    │  │     Virtual File System (VFS)    │   │
                    │  │     Common interface for all FS  │   │
                    │  └──────────────┬──────────────────┘    │
                    │                 │                       │
                    │    ┌────────────┼────────────┐          │
                    │    ▼            ▼            ▼          │
                    │  ┌────┐     ┌──────┐     ┌──────┐       │
                    │  │ext4│     │ XFS  │     │ NFS  │       │
                    │  └──┬─┘     └──┬───┘     └──┬───┘       │
                    │     │          │            │           │
                    │  ┌──┴──────────┴────────────┴──┐        │
                    │  │       Block Layer           │        │
                    │  │    (I/O scheduling, cache)  │        │
                    │  └──────────────┬──────────────┘        │
                    │                 │                       │
                    └─────────────────┼───────────────────────┘
                                      ▼
                    ┌─────────────────────────────────────────┐
                    │              HARDWARE                   │
                    │         (HDD, SSD, NVMe)                │
                    └─────────────────────────────────────────┘

7.2 Inodes and Directory Structure

INODE: FILE METADATA

┌────────────────────────────────────────────────────────────────────────┐
│                           INODE                                        │
│                                                                        │
│   inode number: 12345                                                  │
│                                                                        │
│   ├── File type (regular, directory, symlink, etc.)                    │
│   ├── Permissions (rwxr-xr-x)                                          │
│   ├── Owner (UID)                                                      │
│   ├── Group (GID)                                                      │
│   ├── Size (bytes)                                                     │
│   ├── Timestamps (atime, mtime, ctime)                                 │
│   ├── Link count                                                       │
│   └── Data block pointers:                                             │
│       ├── Direct blocks (12): point directly to data                   │
│       ├── Indirect block: points to block of pointers                  │
│       ├── Double indirect: pointer → pointers → pointers → data        │
│       └── Triple indirect: one more level                              │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

NOTE: Filename is NOT in inode!
      Directory maps filename → inode number

DIRECTORY STRUCTURE:

  Directory "mydir" (inode 100):
  ┌────────────────────────────────┐
  │  .        →  100  (self)       │
  │  ..       →  50   (parent)     │
  │  file.txt →  12345             │
  │  subdir   →  200               │
  │  link.txt →  12345  (hardlink) │
  └────────────────────────────────┘

7.3 Page Cache (Buffer Cache)

PAGE CACHE: OS-LEVEL FILE CACHING

┌────────────────────────────────────────────────────────────────────────┐
│                         PAGE CACHE                                     │
│                                                                        │
│   Application                                                          │
│       │                                                                │
│       │  read(fd, buf, 4096)                                           │
│       ▼                                                                │
│   ┌────────────────────────────────────────────────────────────────┐   │
│   │                      PAGE CACHE (RAM)                          │   │
│   │                                                                │   │
│   │   ┌───────┬───────┬───────┬───────┬───────┬───────┐            │   │
│   │   │Page 0 │Page 1 │Page 2 │Page 3 │  ...  │Page N │            │   │
│   │   │(file1)│(file1)│(file2)│(file3)│       │       │            │   │
│   │   └───────┴───────┴───────┴───────┴───────┴───────┘            │   │
│   │                                                                │   │
│   │   Cache HIT: Return from RAM (~100ns)                          │   │
│   │   Cache MISS: Read from disk, cache, return (~10ms for HDD)    │   │
│   │                                                                │   │
│   └────────────────────────────────────────────────────────────────┘   │
│                                                                        │
│   WRITE BEHAVIOR:                                                      │
│                                                                        │
│   1. Write-back (default):                                             │
│      Write to cache → return immediately → flush later                 │
│      Fast but data loss risk on crash                                  │
│                                                                        │
│   2. Write-through:                                                    │
│      Write to cache AND disk → return                                  │
│      Slower but safer                                                  │
│                                                                        │
│   3. O_DIRECT:                                                         │
│      Bypass cache entirely                                             │
│      Used by databases that manage own cache                           │
│                                                                        │
│   4. fsync():                                                          │
│      Force flush to disk                                               │
│      Ensures durability                                                │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

SYSTEM DESIGN IMPACT:

  Reading same file twice?
    → Second read from cache (fast)
    
  Too much caching eating RAM?
    → Tune vm.dirty_ratio, vm.dirty_background_ratio
    
  Database doing own caching?
    → Use O_DIRECT to avoid double caching
    
  Need durability?
    → fsync() after writes (databases do this)

7.4 Write-Ahead Logging (WAL)

WRITE-AHEAD LOGGING

Problem: 
  Updating data in place is not atomic
  Crash mid-write = corrupted data
  
Solution:
  Write to log FIRST, then apply changes
  On crash recovery, replay log

┌────────────────────────────────────────────────────────────────────────┐
│                    WRITE-AHEAD LOG (WAL)                               │
│                                                                        │
│   1. Write to WAL (append-only, sequential)                            │
│   2. fsync() WAL                                                       │
│   3. Apply changes to data files                                       │
│   4. Eventually checkpoint and truncate WAL                            │
│                                                                        │
│   Normal Operation:                                                    │
│   ────────────────────────────────────────────────────                 │
│   Transaction → WAL → fsync → Data Files                               │
│                                                                        │
│   Crash Recovery:                                                      │
│   ────────────────────────────────────────────────────                 │
│   Read WAL → Replay uncommitted transactions → Consistent state        │
│                                                                        │
│   Why Sequential Writes?                                               │
│   ────────────────────────────────────────────────────                 │
│   HDD sequential: ~100 MB/s                                            │
│   HDD random:     ~1 MB/s                                              │
│   WAL is append-only → sequential → fast!                              │
│                                                                        │
│   Used By:                                                             │
│   - PostgreSQL (WAL)                                                   │
│   - MySQL InnoDB (redo log)                                            │
│   - SQLite (WAL mode)                                                  │
│   - Kafka (append-only log)                                            │
│   - LSM-tree databases (RocksDB, LevelDB)                              │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Part IV: System Limits and Tuning

Chapter 8: File Descriptors

8.1 Understanding File Descriptors

FILE DESCRIPTORS

Everything in Unix is a file (or file-like):
  - Regular files
  - Directories
  - Sockets (network connections)
  - Pipes
  - Devices

File Descriptor = integer handle to open file/resource

┌────────────────────────────────────────────────────────────────────────┐
│                    PROCESS FILE DESCRIPTOR TABLE                       │
│                                                                        │
│   fd │ Description        │ Points to                                  │
│   ───┼────────────────────┼────────────────────────────────────────    │
│   0  │ stdin              │ Terminal input                             │
│   1  │ stdout             │ Terminal output                            │
│   2  │ stderr             │ Terminal error                             │
│   3  │ log file           │ /var/log/app.log                           │
│   4  │ socket             │ TCP connection to database                 │
│   5  │ socket             │ TCP connection to client                   │
│   6  │ socket             │ TCP connection to client                   │
│   ...│                    │                                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

LIMITS:

  Per-process limit:
    ulimit -n         # Show current limit (often 1024 default)
    ulimit -n 65535   # Increase for this shell
    
  System-wide limit:
    cat /proc/sys/fs/file-max   # System maximum
    
  For servers handling many connections:
    Each TCP connection = 1 file descriptor
    10,000 connections = need at least 10,000 fd limit

8.2 Common Limits to Tune

┌────────────────────────────────────────────────────────────────────────┐
│                    SYSTEM LIMITS TO TUNE                               │
│                                                                        │
│  LIMIT                  │ DEFAULT    │ TUNED VALUE │ WHY               │
│  ───────────────────────┼────────────┼─────────────┼─────────────────  │
│                                                                        │
│  FILE DESCRIPTORS                                                      │
│  ulimit -n              │ 1,024      │ 65,535+     │ Many connections  │
│  /etc/security/limits.conf                                             │
│                                                                        │
│  PROCESSES/THREADS                                                     │
│  ulimit -u              │ ~30,000    │ 100,000+    │ Thread pool size  │
│  kernel.pid_max                                                        │
│                                                                        │
│  TCP TUNING                                                            │
│  net.core.somaxconn     │ 128        │ 65,535      │ Listen backlog    │
│  net.ipv4.tcp_max_syn_backlog │ 128  │ 65,535      │ SYN queue size    │
│  net.ipv4.ip_local_port_range │ 32768-60999 │ 1024-65535 │ More ports  │
│  net.ipv4.tcp_tw_reuse  │ 0          │ 1           │ Reuse TIME_WAIT   │
│                                                                        │
│  MEMORY                                                                │
│  vm.swappiness          │ 60         │ 10          │ Reduce swap usage │
│  vm.dirty_ratio         │ 20         │ 60          │ Write-back tuning │
│  vm.overcommit_memory   │ 0          │ 1           │ Redis requirement │
│                                                                        │
│  NETWORK BUFFERS                                                       │
│  net.core.rmem_max      │ 212992     │ 16777216    │ TCP receive buf   │
│  net.core.wmem_max      │ 212992     │ 16777216    │ TCP send buffer   │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

CHECKING CURRENT VALUES:

  sysctl -a | grep <parameter>
  cat /proc/sys/net/core/somaxconn
  
SETTING VALUES:

  Temporary (until reboot):
    sysctl -w net.core.somaxconn=65535
    
  Permanent:
    echo "net.core.somaxconn = 65535" >> /etc/sysctl.conf
    sysctl -p

Chapter 9: Containerization and Namespaces

9.1 Linux Namespaces

NAMESPACES: ISOLATION FOR CONTAINERS

Namespaces provide isolated views of system resources:

┌────────────────────────────────────────────────────────────────────────┐
│                         LINUX NAMESPACES                               │
│                                                                        │
│  NAMESPACE      │ ISOLATES                    │ EXAMPLE                │
│  ───────────────┼─────────────────────────────┼──────────────────────  │
│  PID            │ Process IDs                 │ Container sees PID 1   │
│  NET            │ Network stack               │ Own IP, ports, routes  │
│  MNT            │ Mount points                │ Own filesystem view    │
│  UTS            │ Hostname, domain            │ Container hostname     │
│  IPC            │ IPC resources               │ Own message queues     │
│  USER           │ User/group IDs              │ Root in container      │
│  CGROUP         │ Cgroup visibility           │ Own cgroup root        │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

CONTAINER = Process with all namespaces isolated

  Host                           Container
  ────────────────────────────   ────────────────────────────
  PID 1: systemd                 PID 1: nginx (actually 25463 on host)
  PID 2: ...                     PID 2: nginx worker
  ...                            
  PID 25463: nginx               
  
  Network: eth0 (10.0.0.5)       Network: eth0 (172.17.0.2)
  
  Filesystem: /                  Filesystem: / (overlay on host)

9.2 Control Groups (cgroups)

CGROUPS: RESOURCE LIMITS FOR CONTAINERS

┌────────────────────────────────────────────────────────────────────────┐
│                           CGROUPS                                      │
│                                                                        │
│  Resource limits and accounting for process groups:                    │
│                                                                        │
│  CPU:                                                                  │
│    cpu.shares          Relative CPU weight                             │
│    cpu.cfs_quota_us    Hard CPU limit (μs per period)                  │
│    cpu.cfs_period_us   Period length (default 100ms)                   │
│                                                                        │
│    Example: 50% CPU = quota 50000, period 100000                       │
│                                                                        │
│  MEMORY:                                                               │
│    memory.limit_in_bytes    Hard memory limit                          │
│    memory.soft_limit_in_bytes    Soft limit (reclaim target)           │
│    memory.oom_control       OOM killer behavior                        │
│                                                                        │
│    Container exceeds limit → OOM killed                                │
│                                                                        │
│  I/O:                                                                  │
│    blkio.throttle.read_bps_device    Read bandwidth limit              │
│    blkio.throttle.write_bps_device   Write bandwidth limit             │
│                                                                        │
│  NETWORK:                                                              │
│    (Handled by tc/iptables, not cgroups directly)                      │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

KUBERNETES RESOURCE REQUESTS/LIMITS:

  apiVersion: v1
  kind: Pod
  spec:
    containers:
    - name: app
      resources:
        requests:           # Scheduling guarantee
          memory: "256Mi"
          cpu: "250m"       # 0.25 cores
        limits:             # Hard ceiling
          memory: "512Mi"   # OOM killed if exceeded
          cpu: "500m"       # Throttled if exceeded

Part V: Interview Questions and Answers

Chapter 10: Common Interview Questions

10.1 Process and Thread Questions

Q: "What's the difference between a process and a thread?"

GREAT ANSWER:
"A process is an independent execution unit with its own memory space,
file descriptors, and system resources. Processes are isolated from each
other — one crashing doesn't affect others.

A thread is a lightweight execution unit within a process. Multiple threads
share the same memory space, heap, and file descriptors, but each has its
own stack and registers.

Key differences:
- Memory: Processes isolated, threads share
- Creation cost: Process ~10ms, thread ~10μs  
- Communication: Processes need IPC, threads share memory directly
- Crash impact: Process crash is isolated, thread crash kills whole process

I'd use multiple processes when I need isolation (like running untrusted
code) or want to leverage multiple machines. I'd use threads when I need
shared state and low-latency communication within a single machine.

Modern systems often combine both — like nginx using multiple worker
processes, each with async I/O handling many connections."

────────────────────────────────────────────────────────────────────────────

Q: "How would you handle 1 million concurrent connections?"

GREAT ANSWER:
"This is the C10M problem. The key is minimizing overhead per connection.

First, avoid thread-per-connection. At 1M connections, that's 1M threads,
each using memory for stack (~1MB default). That's 1TB just for stacks!

Instead, I'd use:

1. Event-driven architecture: epoll/kqueue with non-blocking I/O. One thread
   can handle thousands of connections by only processing when data is ready.

2. Multiple event loops: One per CPU core. With 32 cores, each handles ~30K
   connections.

3. System tuning:
   - File descriptor limits: at least 1M
   - Ephemeral port range expanded
   - TCP buffer sizes optimized
   - somaxconn increased for accept queue

4. Memory efficiency:
   - Connection state < 1KB per connection
   - Zero-copy I/O where possible
   - Object pooling to avoid allocation overhead

5. Consider io_uring on modern Linux for even better performance — it
   reduces syscall overhead significantly.

Companies like Discord and WhatsApp handle millions of connections using
these techniques."

10.2 Memory Questions

Q: "Explain virtual memory and why it's important."

GREAT ANSWER:
"Virtual memory gives each process the illusion of having its own
contiguous address space, even though physical memory is shared and
fragmented.

How it works:
- Each process has a page table mapping virtual to physical addresses
- CPU's MMU translates addresses on every memory access
- TLB caches recent translations for performance

Key benefits:

1. Isolation: Process A can't access Process B's memory. Even if both use
   virtual address 0x1000, they map to different physical frames.

2. Simplification: Programmer doesn't worry about where memory physically
   is. malloc() returns an address that just works.

3. Overcommit: Can allocate more virtual memory than physical RAM. Pages
   not actively used can be swapped to disk.

4. Shared memory: Same physical page can appear in multiple address spaces.
   Used for shared libraries (libc loaded once, shared by all).

5. Memory-mapped files: Map files into address space, let OS handle I/O.

The cost is the page table walk on TLB miss (~100 cycles). That's why TLB
hit rate matters, and why huge pages (2MB instead of 4KB) can help —
fewer TLB entries needed to cover same memory."

────────────────────────────────────────────────────────────────────────────

Q: "What happens when a process accesses memory that's not in RAM?"

GREAT ANSWER:
"That triggers a page fault. There are three types:

1. Minor page fault: The page exists in memory but isn't mapped in the
   page table. Common after fork() with copy-on-write, or first access
   to newly allocated memory. OS just updates the page table. Cost: ~1μs.

2. Major page fault: The page isn't in memory at all. Either it was
   swapped out, or it's a memory-mapped file being accessed. OS must
   read from disk. Cost: ~10ms for HDD, ~100μs for SSD.

3. Invalid fault: The address isn't valid — null pointer, stack overflow,
   or permission violation. Results in SIGSEGV (segmentation fault).

For high-performance systems, you want to minimize major page faults:
- Disable swap for databases (or use fast NVMe for swap)
- Pre-fault memory on startup (touch all pages)
- Use mlock() to prevent critical pages from swapping
- Monitor page fault rates (vmstat, perf)

Databases like Redis prefer to crash rather than swap — OOM is better
than unpredictable latency from page faults."

10.3 I/O Questions

Q: "Explain the different I/O models in Linux."

GREAT ANSWER:
"There are five main I/O models, each with different tradeoffs:

1. Blocking I/O:
   read() blocks until data available. Simple but wastes thread while
   waiting. One thread per connection doesn't scale.

2. Non-blocking I/O:
   read() returns immediately with EAGAIN if no data. Application must
   poll repeatedly. Wastes CPU cycles polling.

3. I/O Multiplexing (select/poll/epoll):
   One syscall monitors many file descriptors. epoll_wait() blocks until
   ANY descriptor is ready, then returns which ones. Efficient for many
   connections. This is what nginx, Node.js, Redis use.

4. Signal-driven I/O:
   Kernel sends signal when I/O is ready. Application handles in signal
   handler. Rarely used — signals are tricky.

5. Async I/O (aio, io_uring):
   Submit I/O operations, kernel completes them in background. io_uring
   on modern Linux is the most efficient — submissions and completions
   via shared memory rings, minimal syscalls.

For a high-performance server, I'd use epoll (or kqueue on BSD) for
network I/O. For disk I/O where async matters (like databases), I'd
consider io_uring on Linux 5.1+.

The key insight: match I/O model to workload. CPU-bound work needs
threads. I/O-bound work with many connections needs event-driven."

────────────────────────────────────────────────────────────────────────────

Q: "How does epoll differ from select?"

GREAT ANSWER:
"Both monitor multiple file descriptors, but epoll is far more efficient
at scale.

select() problems:
- Limited to 1024 fds (FD_SETSIZE)
- O(n) kernel scan of ALL fds each call
- Must rebuild fd_set after each call
- Returns all fds, you check which are ready

epoll() solutions:
- No fd limit (just system resources)
- O(1) monitoring — kernel maintains interest list
- Register once with epoll_ctl(), monitor forever
- Only returns READY fds, not all fds

Performance comparison with 10,000 connections:
- select: Scan 10,000 fds per call, even if only 1 is ready
- epoll: Return immediately with just the 1 ready fd

epoll also supports edge-triggered mode: only notifies on state CHANGE,
not while state IS ready. More efficient but requires draining the buffer
completely each notification.

For any serious server handling more than a few hundred connections,
epoll (or kqueue on macOS/BSD) is essential."

10.4 Concurrency Questions

Q: "What is a deadlock and how do you prevent it?"

GREAT ANSWER:
"A deadlock is when two or more threads are waiting for each other to
release resources, creating a cycle where none can proceed.

Four conditions must ALL be true for deadlock:
1. Mutual exclusion: Resources can't be shared
2. Hold and wait: Thread holds one resource while waiting for another
3. No preemption: Resources can't be forcibly taken
4. Circular wait: A cycle exists in the wait graph

Prevention strategies break one condition:

1. Lock ordering: Always acquire locks in the same order globally.
   If everyone acquires A before B, no cycle is possible.

2. Lock timeout: Don't wait forever. Try to acquire, timeout, release
   all locks, retry with backoff. Breaks 'hold and wait'.

3. Single lock: Use one coarse-grained lock instead of many fine-grained
   ones. Simpler but reduces parallelism.

4. Lock-free algorithms: Use atomic operations (CAS) instead of locks.
   No locks = no deadlock. But complex to implement correctly.

5. Deadlock detection: Build wait-for graph, detect cycles, abort one
   transaction. Used by databases.

In practice, I use a combination:
- Lock ordering for related locks
- Timeouts for external resources
- Careful design to minimize lock scope"

────────────────────────────────────────────────────────────────────────────

Q: "What's the difference between a mutex and a semaphore?"

GREAT ANSWER:
"Both are synchronization primitives, but with different semantics:

Mutex (Mutual Exclusion):
- Binary: locked or unlocked
- Ownership: Thread that locks must unlock
- Purpose: Protect critical sections (one thread at a time)
- Use case: Protecting shared data structure

Semaphore:
- Counting: Can allow N threads (not just 1)
- No ownership: Any thread can signal (increment)
- Purpose: Control access to limited resources
- Use case: Connection pool with max 10 connections

Example differences:

Mutex for protecting a map:
  mutex.lock()
  map[key] = value  // Only one thread here
  mutex.unlock()

Semaphore for connection pool:
  sem = Semaphore(10)     // Allow 10 concurrent
  sem.acquire()           // Blocks if 10 are already held
  conn = pool.get()
  // use connection
  pool.release(conn)
  sem.release()           // Allow another thread in

Binary semaphore (count=1) looks like a mutex but lacks ownership —
any thread can release it. This makes it suitable for signaling between
threads (producer signals consumer) but not for protecting critical sections.

I prefer mutex for protection, semaphore for resource counting."

Summary

┌────────────────────────────────────────────────────────────────────────┐
│                    OS CONCEPTS KEY TAKEAWAYS                           │
│                                                                        │
│  PROCESSES & THREADS:                                                  │
│  • Process: isolated memory, heavy, multi-machine scalable             │
│  • Thread: shared memory, light, single-machine only                   │
│  • Modern: hybrid (Go goroutines, event loops + thread pools)          │
│                                                                        │
│  SYNCHRONIZATION:                                                      │
│  • Mutex: one thread at a time, has ownership                          │
│  • Semaphore: N threads, counting, no ownership                        │
│  • RWLock: many readers OR one writer                                  │
│  • Atomic: lock-free, hardware-supported                               │
│                                                                        │
│  MEMORY:                                                               │
│  • Virtual memory: isolation, simplification, overcommit               │
│  • Page fault: minor (fast) vs major (disk I/O, slow)                  │
│  • Cache hierarchy: L1 (1ns) → L2 (4ns) → L3 (20ns) → RAM (100ns)      │
│  • Cache-friendly code: sequential access, SoA over AoS                │
│                                                                        │
│  I/O MODELS:                                                           │
│  • Blocking: simple but doesn't scale                                  │
│  • Non-blocking + epoll: handles many connections efficiently          │
│  • io_uring: next-gen async I/O on Linux                               │
│  • Event loop: single thread handles thousands of connections          │
│                                                                        │
│  FILE SYSTEMS:                                                         │
│  • Page cache: OS caches file data in RAM                              │
│  • fsync(): ensures durability (flush to disk)                         │
│  • WAL: write-ahead log for crash recovery                             │
│                                                                        │
│  SYSTEM TUNING:                                                        │
│  • File descriptors: increase ulimit for many connections              │
│  • TCP: somaxconn, port range, buffer sizes                            │
│  • Memory: swappiness, dirty ratios                                    │
│                                                                        │
│  CONTAINERS:                                                           │
│  • Namespaces: isolation (PID, NET, MNT, etc.)                         │
│  • Cgroups: resource limits (CPU, memory, I/O)                         │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

📚 Further Reading

Books

  • Operating Systems: Three Easy Pieces (free online)
  • Linux Kernel Development by Robert Love
  • Understanding the Linux Kernel by Bovet & Cesati
  • Systems Performance by Brendan Gregg

Documentation

Tools & Observability

  • perf: Linux profiling
  • strace: System call tracing
  • vmstat: Virtual memory statistics
  • iostat: I/O statistics
  • htop/top: Process monitoring
  • bpftrace: Dynamic tracing

Blog Posts & Articles

Videos


End of Week 0 — Part 5: Operating System Concepts

Week 0 Complete! You now have the foundation to dive into system design patterns in Week 1 — Foundations of Scale.