Bonus Problem 8: AWS S3

The World's Most Durable Storage System — 11 Nines That Changed Everything

🗄️ How Do You Promise to Never Lose a Single File Out of 500 Trillion?

Imagine this challenge: You're storing 500 trillion objects—hundreds of exabytes of data—for millions of customers worldwide. Hard drives fail constantly. Data centers lose power. Natural disasters strike. And yet, you promise to lose only 1 object out of 10 million... every 10,000 years.

This is Amazon S3—the storage system that defied distributed systems theory by delivering both strong consistency AND 99.999999999% durability at planetary scale. It's not just cloud storage. It's an engineering marvel that other services are measured against.

THE AWS S3 SCALE (2025)

┌──────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│   STORAGE                                                                │
│   ───────                                                                │
│   Objects stored:                500+ Trillion                           │
│   Data stored:                   Hundreds of Exabytes                    │
│   Single object max size:        5 TB (up to 50 TB in 2025)              │
│                                                                          │
│   REQUESTS                                                               │
│   ────────                                                               │
│   Requests per second:           200+ Million (peak)                     │
│   GET requests per prefix:       5,500/second                            │
│   PUT requests per prefix:       3,500/second                            │
│                                                                          │
│   DURABILITY                                                             │
│   ──────────                                                             │
│   Durability:                    99.999999999% (11 nines)                │
│   Availability (Standard):       99.99%                                  │
│   Availability zones (min):      3 per region                            │
│                                                                          │
│   ARCHITECTURE                                                           │
│   ────────────                                                           │
│   Microservices:                 350+                                    │
│   AWS Regions:                   30+                                     │
│   Storage classes:               9 different tiers                       │
│                                                                          │
│   HISTORY                                                                │
│   ───────                                                                │
│   Launched:                      March 14, 2006                          │
│   First AWS service:             Yes (along with SQS)                    │
│   Years of operation:            19 years                                │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

This is the system we'll design today—and discover how AWS achieved the impossible: strong consistency at 11 nines of durability.

The Interview Begins

You walk into the interview room. The interviewer smiles and gestures to the whiteboard.

Interviewer: "Thanks for coming in. Today we're going to design a distributed object storage system like S3. I'm interested in how you think about durability, consistency, and scale. Please think out loud—this is collaborative."

They write on the whiteboard:

╔══════════════════════════════════════════════════════════════════════════╗
║                                                                          ║
║                Design a Distributed Object Storage System                ║
║                                                                          ║
║   Requirements:                                                          ║
║   - Store objects up to 5 TB each                                        ║
║   - 11 nines (99.999999999%) durability                                  ║
║   - Strong read-after-write consistency                                  ║
║   - Handle millions of requests per second                               ║
║   - Multiple storage tiers for cost optimization                         ║
║   - Global availability across multiple regions                          ║
║                                                                          ║
╚══════════════════════════════════════════════════════════════════════════╝

Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."

Phase 1: Requirements Clarification (5 minutes)

Before diving in, you take a breath and start asking questions.

Your Questions

You: "Before I start designing, let me clarify some requirements. What's the target scale—how many objects and how much data?"

Interviewer: "Think planetary scale. Hundreds of trillions of objects, exabytes of data."

You: "For durability, you mentioned 11 nines. What does that mean in practice?"

Interviewer: "If you store 10 million objects, you might lose 1 object every 10,000 years. Durability is non-negotiable—data loss is catastrophic."

You: "What about consistency? Traditional distributed systems trade consistency for availability."

Interviewer: "We need strong read-after-write consistency. After a write succeeds, any subsequent read must return the latest data. No stale reads."

You: "What are the latency requirements for reads and writes?"

Interviewer: "First byte latency around 100-200ms for small objects. Large objects can be slower but must support parallel downloads."

You: "Perfect. Let me summarize the requirements."

Functional Requirements

1. OBJECT OPERATIONS
   - PUT: Upload objects up to 5 TB
   - GET: Download objects (full or byte-range)
   - DELETE: Remove objects
   - LIST: Enumerate objects with prefix filtering
   - HEAD: Get object metadata without content
   - Multipart upload for large objects

2. ORGANIZATION
   - Buckets: Containers for objects
   - Keys: Unique identifiers within buckets
   - Prefixes: Logical grouping (like folders)
   - Metadata: Custom key-value pairs per object

3. DATA MANAGEMENT
   - Versioning: Keep all versions of objects
   - Lifecycle policies: Auto-transition storage tiers
   - Replication: Cross-region copying
   - Encryption: At rest and in transit

4. ACCESS CONTROL
   - IAM policies: User-level permissions
   - Bucket policies: Resource-level permissions
   - ACLs: Fine-grained object access
   - Presigned URLs: Temporary access

Non-Functional Requirements

1. DURABILITY
   - 99.999999999% (11 nines)
   - 1 object lost per 10 million over 10,000 years
   - Survive AZ failures, region failures

2. AVAILABILITY
   - 99.99% for Standard tier
   - 99.9% for infrequent access tiers
   - Graceful degradation under load

3. CONSISTENCY
   - Strong read-after-write for all operations
   - Strong consistency for LIST operations
   - No stale reads after successful writes

4. PERFORMANCE
   - First byte latency: 100-200ms
   - 3,500 PUT/second per prefix
   - 5,500 GET/second per prefix
   - Unlimited aggregate throughput

Phase 2: Back of the Envelope Estimation (5 minutes)

You: "Let me work through the numbers to understand the scale."

Storage Estimation

OBJECT STORAGE

Scale:
  Objects stored:                500 trillion
  Average object size:           ~500 KB (highly variable)
  Total storage:                 ~250 exabytes

Storage overhead:
  Erasure coding overhead:       ~1.5x (vs 3x for replication)
  Metadata per object:           ~1 KB
  Metadata storage:              500 trillion × 1 KB = 500 PB

With durability overhead:
  Raw storage needed:            250 EB × 1.5 = 375 EB
  Across 3+ AZs:                 Distributed globally

Request Estimation

REQUEST VOLUME

Peak requests:
  Total requests/second:         200+ million
  
Per-prefix limits:
  GET/HEAD per prefix:           5,500/second
  PUT/DELETE per prefix:         3,500/second
  
Daily requests:
  200M × 86,400 = ~17 trillion requests/day

Key Metrics Summary

┌──────────────────────────────────────────────────────────────────────────┐
│                         ESTIMATION SUMMARY                               │
│                                                                          │
│   STORAGE                                                                │
│   ├── Objects:                   500 trillion                            │
│   ├── Raw data:                  ~250 exabytes                           │
│   └── With redundancy:           ~375 exabytes                           │
│                                                                          │
│   REQUESTS                                                               │
│   ├── Peak:                      200+ million/second                     │
│   ├── Daily:                     ~17 trillion                            │
│   └── Per-prefix GET limit:      5,500/second                            │
│                                                                          │
│   INFRASTRUCTURE                                                         │
│   ├── Availability Zones:        3+ per region                           │
│   ├── Regions:                   30+                                     │
│   └── Hard drives:               Millions                                │
│                                                                          │
│   DURABILITY MATH                                                        │
│   ├── Annual failure rate:       ~0.000000001%                           │
│   ├── Objects at risk:           1 in 10 billion/year                    │
│   └── Expected loss:             ~50 objects/year (out of 500T)          │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Phase 3: High-Level Design (10 minutes)

You: "Now let me sketch out the high-level architecture."

System Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                       S3 HIGH-LEVEL ARCHITECTURE                         │
│                                                                          │
│   ┌─────────────────────────────────────────────────────────────────┐    │
│   │                         CLIENTS                                 │    │
│   │    ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐           │    │
│   │    │AWS SDK  │  │REST API │  │AWS CLI  │  │ Console │           │    │
│   │    └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘           │    │
│   └─────────┼────────────┼────────────┼────────────┼────────────────┘    │
│             └────────────┴─────┬──────┴────────────┘                     │
│                                │                                         │
│   ┌────────────────────────────┼────────────────────────────────────┐    │
│   │                            ▼                                    │    │
│   │    ┌──────────────────────────────────────────────────────┐     │    │
│   │    │                   FRONT END LAYER                    │     │    │
│   │    │   DNS → Load Balancer → Authentication → Routing     │     │    │
│   │    └──────────────────────────────────────────────────────┘     │    │
│   │                            │                                    │    │
│   │    ┌──────────────────────────────────────────────────────┐     │    │
│   │    │                  INDEX/METADATA LAYER                │     │    │
│   │    │                                                      │     │    │
│   │    │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │     │    │
│   │    │   │  Partition  │  │   Global    │  │  Lifecycle  │  │     │    │
│   │    │   │   Engine    │  │  Metadata   │  │   Manager   │  │     │    │
│   │    │   └─────────────┘  └─────────────┘  └─────────────┘  │     │    │
│   │    └──────────────────────────────────────────────────────┘     │    │
│   │                            │                                    │    │
│   │    ┌──────────────────────────────────────────────────────┐     │    │
│   │    │                   STORAGE LAYER                      │     │    │
│   │    │                                                      │     │    │
│   │    │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │     │    │
│   │    │   │   Storage   │  │   Erasure   │  │   Repair    │  │     │    │
│   │    │   │    Nodes    │  │   Coding    │  │   Service   │  │     │    │
│   │    │   └─────────────┘  └─────────────┘  └─────────────┘  │     │    │
│   │    │                                                      │     │    │
│   │    │          AZ-1            AZ-2            AZ-3        │     │    │
│   │    │       ┌───────┐       ┌───────┐       ┌───────┐      │     │    │
│   │    │       │ Disks │       │ Disks │       │ Disks │      │     │    │
│   │    │       └───────┘       └───────┘       └───────┘      │     │    │
│   │    └──────────────────────────────────────────────────────┘     │    │
│   │                                                                 │    │
│   │                          AWS REGION                             │    │
│   └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Component Breakdown

1. Front End Layer

Purpose: Handle all incoming requests, authentication, routing

Key components:

DNS Service: Route requests to correct region/endpoint
Load Balancers: Distribute traffic across API servers
Authentication: Validate IAM credentials, bucket policies
Request Router: Direct to correct storage nodes

2. Index/Metadata Layer

Purpose: Track object locations without storing data

Key components:

Partition Engine: Distribute keys across index partitions
Global Metadata Store: Map keys to physical storage locations
Lifecycle Manager: Handle versioning, retention, expiration

3. Storage Layer

Purpose: Physical storage with durability guarantees

Key components:

Storage Nodes: Thousands of servers with HDDs/SSDs
Erasure Coding: Split data into shards with parity
Repair Service: Detect and recover from failures
Cross-AZ Distribution: Spread across availability zones

Data Flow: PUT Object

PUT OBJECT FLOW

1. CLIENT REQUEST
   PUT /bucket/key HTTP/1.1
   Content-Length: 1048576
   x-amz-content-sha256: <checksum>

2. FRONT END
   ├── DNS resolves to regional endpoint
   ├── Load balancer routes to API server
   ├── Authenticate request (SigV4)
   └── Validate bucket policy

3. INDEX LAYER
   ├── Hash key to determine partition
   ├── Acquire write lock on key
   └── Generate storage location

4. STORAGE LAYER
   ├── Erasure code the data (e.g., 10 data + 6 parity)
   ├── Write shards to storage nodes across AZs
   ├── Calculate and store checksums
   └── BRACKETING: Verify data can be reconstructed

5. COMMIT
   ├── Update metadata index
   ├── Release write lock
   └── Return 200 OK to client

DURABILITY: Data survives if ANY 10 of 16 shards survive

Phase 4: Deep Dives (20 minutes)

Interviewer: "Let's dive into how you achieve 11 nines of durability."

Deep Dive 1: Erasure Coding for Durability (Week 1 Concepts)

The Problem

DURABILITY CHALLENGE

Hard drive failure rates:
  - Annual failure rate: 1-5% per drive
  - At S3 scale: Thousands of drives fail daily
  - Must lose ZERO customer data

Simple replication (3x):
  - Store 3 copies across AZs
  - Problem: 3x storage overhead
  - Problem: All 3 could fail (rare but possible)

Need: High durability with lower storage overhead

The Solution: Erasure Coding

┌─────────────────────────────────────────────────────────────────────────┐
│                      ERASURE CODING ARCHITECTURE                        │
│                                                                         │
│   ORIGINAL DATA (1 MB object)                                           │
│   ┌────────────────────────────────────────────────────────────────┐    │
│   │                        1 MB                                    │    │
│   └────────────────────────────────────────────────────────────────┘    │
│                               │                                         │
│                               ▼                                         │
│   ERASURE ENCODE (Reed-Solomon 10+6)                                    │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │  Split into 10 data shards + compute 6 parity shards             │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                               │                                         │
│                               ▼                                         │
│   16 SHARDS (~100 KB each)                                              │
│   ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│   │ D1 │ D2 │ D3 │ D4 │ D5 │ D6 │ D7 │ D8 │ D9 │D10 │ P1 │ P2 │ P3 │ P4 │ P5 │ P6 │
│   └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
│                               │                                         │
│                               ▼                                         │
│   DISTRIBUTE ACROSS AZs AND DRIVES                                      │
│                                                                         │
│   AZ-1 (drives A-E)    AZ-2 (drives F-J)    AZ-3 (drives K-P)           │
│   ┌────┬────┬────┐    ┌────┬────┬────┐    ┌────┬────┬────┐              │
│   │ D1 │ D4 │ D7 │    │ D2 │ D5 │ D8 │    │ D3 │ D6 │ D9 │              │
│   │ P1 │ P4 │D10 │    │ P2 │ P5 │    │    │ P3 │ P6 │    │              │
│   └────┴────┴────┘    └────┴────┴────┘    └────┴────┴────┘              │
│                                                                         │
│   RECOVERY: Can reconstruct from ANY 10 of 16 shards                    │
│   Can lose up to 6 shards (or entire AZ) and still recover              │
│                                                                         │
│   OVERHEAD: 1.6x storage (vs 3x for replication)                        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Implementation

# storage/erasure_coding.py

"""
Erasure Coding for 11 Nines Durability

Uses Reed-Solomon coding to achieve extreme durability
with only 1.6x storage overhead (vs 3x for replication).
"""

from dataclasses import dataclass
from typing import List, Optional, Tuple
import hashlib


@dataclass
class Shard:
    """A single shard of erasure-coded data."""
    shard_id: int
    data: bytes
    checksum: str
    storage_node: str
    availability_zone: str


@dataclass
class ErasureConfig:
    """Configuration for erasure coding."""
    data_shards: int = 10      # K: minimum shards to reconstruct
    parity_shards: int = 6     # M: redundant shards
    # Total = 16 shards, can lose up to 6


class ErasureCodingService:
    """
    Erasure coding service for S3 durability.
    
    Uses Reed-Solomon (10, 6) coding:
    - Split data into 10 data shards
    - Compute 6 parity shards
    - Can reconstruct from any 10 of 16 shards
    
    Applies concepts:
    - Week 1: Replication and partitioning
    - Week 1: Data durability strategies
    """
    
    def __init__(self, config: ErasureConfig, storage_nodes: List):
        self.config = config
        self.storage_nodes = storage_nodes
        self.total_shards = config.data_shards + config.parity_shards
    
    async def encode_and_store(
        self, 
        object_key: str, 
        data: bytes
    ) -> List[Shard]:
        """
        Encode data and distribute shards across AZs.
        
        Steps:
        1. Calculate checksum of original data
        2. Split into data shards
        3. Compute parity shards
        4. Distribute across AZs (ensuring AZ diversity)
        5. Verify (bracketing) before confirming
        """
        # Step 1: Original checksum
        original_checksum = self._compute_checksum(data)
        
        # Step 2: Split into data shards
        data_shards = self._split_data(data, self.config.data_shards)
        
        # Step 3: Compute parity shards
        parity_shards = self._compute_parity(data_shards)
        
        # Step 4: Select storage nodes (AZ-aware)
        all_shards = data_shards + parity_shards
        shard_placements = self._select_nodes(len(all_shards))
        
        # Step 5: Write shards to storage
        stored_shards = []
        for i, (shard_data, (node, az)) in enumerate(
            zip(all_shards, shard_placements)
        ):
            checksum = self._compute_checksum(shard_data)
            shard = Shard(
                shard_id=i,
                data=shard_data,
                checksum=checksum,
                storage_node=node,
                availability_zone=az
            )
            await self._write_shard(shard)
            stored_shards.append(shard)
        
        # Step 6: BRACKETING - verify we can reconstruct
        if not await self._verify_reconstruction(
            stored_shards, original_checksum
        ):
            raise DurabilityError("Bracketing verification failed")
        
        return stored_shards
    
    async def read_and_decode(
        self, 
        shards: List[Shard]
    ) -> bytes:
        """
        Read shards and reconstruct original data.
        
        Can reconstruct from any K (10) of N (16) shards.
        """
        # Read available shards
        available = []
        for shard in shards:
            try:
                data = await self._read_shard(shard)
                if self._verify_checksum(data, shard.checksum):
                    available.append((shard.shard_id, data))
            except Exception:
                continue  # Shard unavailable
        
        # Need at least K shards
        if len(available) < self.config.data_shards:
            raise DurabilityError(
                f"Only {len(available)} shards available, "
                f"need {self.config.data_shards}"
            )
        
        # Reconstruct using Reed-Solomon
        return self._reconstruct(available)
    
    def _select_nodes(self, count: int) -> List[Tuple[str, str]]:
        """
        Select storage nodes ensuring AZ diversity.
        
        Critical: No more than ceil(count/3) shards per AZ
        This ensures we can lose an entire AZ and still reconstruct.
        """
        placements = []
        az_counts = {}
        max_per_az = (count + 2) // 3  # Ceiling division
        
        for node in self.storage_nodes:
            az = node.availability_zone
            if az_counts.get(az, 0) < max_per_az:
                placements.append((node.id, az))
                az_counts[az] = az_counts.get(az, 0) + 1
            if len(placements) == count:
                break
        
        return placements
    
    async def _verify_reconstruction(
        self, 
        shards: List[Shard], 
        expected_checksum: str
    ) -> bool:
        """
        Bracketing: Verify data can be reconstructed.
        
        This is critical - we don't confirm write success
        until we've proven the data is recoverable.
        """
        try:
            reconstructed = await self.read_and_decode(shards)
            actual_checksum = self._compute_checksum(reconstructed)
            return actual_checksum == expected_checksum
        except Exception:
            return False
    
    def _compute_checksum(self, data: bytes) -> str:
        return hashlib.sha256(data).hexdigest()

Deep Dive 2: Strong Consistency (Week 5 Concepts)

Interviewer: "S3 moved from eventual to strong consistency in 2020. How?"

The Problem

EVENTUAL CONSISTENCY ISSUES (pre-2020)

Scenario 1: Read-after-write
  PUT /bucket/key → 200 OK
  GET /bucket/key → 404 Not Found  (stale!)

Scenario 2: Read-after-update
  PUT /bucket/key v1 → 200 OK
  PUT /bucket/key v2 → 200 OK
  GET /bucket/key → returns v1  (stale!)

Why it happened:
  - Metadata replicated asynchronously
  - Different nodes might see different versions
  - LIST operations even worse

Impact:
  - Data lakes: Missing files in queries
  - Analytics: Incorrect results
  - Required workarounds: Sleep, external databases

The Solution: Strong Consistency

┌─────────────────────────────────────────────────────────────────────────┐
│                    STRONG CONSISTENCY ARCHITECTURE                      │
│                                                                         │
│   KEY INSIGHT: Synchronous metadata replication                         │
│                                                                         │
│   PUT FLOW (with strong consistency)                                    │
│   ──────────────────────────────────                                    │
│                                                                         │
│   1. Client sends PUT                                                   │
│      │                                                                  │
│      ▼                                                                  │
│   2. Write data to storage layer                                        │
│      │                                                                  │
│      ▼                                                                  │
│   3. Update metadata (SYNCHRONOUSLY)                                    │
│      ├── Write to primary metadata node                                 │
│      ├── Replicate to ALL metadata replicas                             │
│      └── Wait for acknowledgment from majority                          │
│      │                                                                  │
│      ▼                                                                  │
│   4. Return 200 OK only after metadata is consistent                    │
│                                                                         │
│   GET FLOW (after strong consistency)                                   │
│   ───────────────────────────────────                                   │
│                                                                         │
│   1. Read from ANY metadata node                                        │
│      - All nodes guaranteed to have latest                              │
│      - No more stale reads                                              │
│                                                                         │
│   TRADE-OFFS                                                            │
│   ──────────                                                            │
│   Before: Higher throughput, eventual consistency                       │
│   After:  Same throughput(!), strong consistency                        │
│                                                                         │
│   AWS achieved this through:                                            │
│   - Optimized network between AZs                                       │
│   - Pipelining metadata updates                                         │
│   - Reducing partition risk to near-zero                                │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Implementation

# metadata/consistency.py

"""
Strong Consistency for S3 Metadata

Ensures read-after-write consistency for all operations.
"""

from dataclasses import dataclass
from typing import Optional, List
from datetime import datetime
import asyncio


@dataclass
class MetadataEntry:
    """Object metadata entry."""
    bucket: str
    key: str
    version_id: str
    etag: str
    size: int
    storage_class: str
    last_modified: datetime
    shard_locations: List[str]


class StrongConsistencyMetadataStore:
    """
    Metadata store with strong read-after-write consistency.
    
    Key mechanism: Synchronous replication to all replicas
    before acknowledging write success.
    
    Applies concepts:
    - Week 5: Strong vs eventual consistency
    - Week 5: Consensus protocols
    """
    
    def __init__(self, replicas: List, quorum_size: int):
        self.replicas = replicas
        self.quorum_size = quorum_size  # Majority
    
    async def put_metadata(
        self, 
        entry: MetadataEntry
    ) -> bool:
        """
        Store metadata with strong consistency.
        
        Only returns success after majority of replicas
        have durably stored the metadata.
        """
        # Generate monotonic version
        entry.version_id = self._generate_version_id()
        
        # Write to all replicas in parallel
        write_tasks = [
            self._write_to_replica(replica, entry)
            for replica in self.replicas
        ]
        
        # Wait for quorum (majority)
        results = await asyncio.gather(
            *write_tasks, 
            return_exceptions=True
        )
        
        successes = sum(1 for r in results if r is True)
        
        if successes >= self.quorum_size:
            # Quorum achieved - write is durable and consistent
            return True
        else:
            # Rollback on failure
            await self._rollback(entry)
            return False
    
    async def get_metadata(
        self, 
        bucket: str, 
        key: str
    ) -> Optional[MetadataEntry]:
        """
        Read metadata - guaranteed to return latest version.
        
        With strong consistency, we can read from any replica.
        All replicas are guaranteed to have the latest data.
        """
        # Read from nearest replica (any will do)
        replica = self._select_nearest_replica()
        
        entry = await self._read_from_replica(replica, bucket, key)
        
        return entry
    
    async def list_objects(
        self, 
        bucket: str, 
        prefix: str
    ) -> List[MetadataEntry]:
        """
        List objects - strongly consistent.
        
        After a PUT, the object is immediately visible in LIST.
        """
        replica = self._select_nearest_replica()
        
        # Read from partition responsible for this prefix
        partition = self._get_partition(bucket, prefix)
        
        return await self._list_from_partition(replica, partition, prefix)
    
    def _generate_version_id(self) -> str:
        """Generate monotonically increasing version ID."""
        # Combination of timestamp + node ID + sequence
        import uuid
        return str(uuid.uuid4())

Deep Dive 3: Index Partitioning (Week 1 Concepts)

Interviewer: "How do you index 500 trillion objects?"

The Problem

INDEXING 500 TRILLION OBJECTS

Challenge:
  - Cannot fit in single index (way too large)
  - Cannot scan linearly (too slow)
  - Must support prefix queries
  - Must handle hot prefixes

Requirements:
  - O(1) lookups by key
  - O(n) prefix scans (where n = matching keys)
  - Automatic scaling
  - No hotspots

The Solution: Key Partitioning

┌─────────────────────────────────────────────────────────────────────────┐
│                    INDEX PARTITIONING ARCHITECTURE                      │
│                                                                         │
│   OBJECT KEY: s3://my-bucket/2024/logs/server1/app.log                  │
│                                                                         │
│   PARTITIONING STRATEGY                                                 │
│   ─────────────────────                                                 │
│                                                                         │
│   1. Hash bucket name → Bucket partition                                │
│      hash("my-bucket") → Partition 42                                   │
│                                                                         │
│   2. Within bucket, partition by key prefix                             │
│      "2024/logs" → Sub-partition 7                                      │
│                                                                         │
│   3. Keys sorted within partition for efficient prefix scans            │
│                                                                         │
│   PARTITION STRUCTURE                                                   │
│   ───────────────────                                                   │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                     BUCKET PARTITION 42                         │   │
│   │                                                                 │   │
│   │  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐          │   │
│   │  │ Sub-part 1    │ │ Sub-part 2    │ │ Sub-part 7    │   ...    │   │
│   │  │ prefix: ""    │ │ prefix: "a"   │ │ prefix:"2024" │          │   │
│   │  │               │ │               │ │               │          │   │
│   │  │ key1 → loc    │ │ a/b → loc     │ │ 2024/logs/ →  │          │   │
│   │  │ key2 → loc    │ │ a/c → loc     │ │ 2024/data/ →  │          │   │
│   │  │ ...           │ │ ...           │ │ ...           │          │   │
│   │  └───────────────┘ └───────────────┘ └───────────────┘          │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   DYNAMIC PARTITIONING                                                  │
│   ────────────────────                                                  │
│   - S3 monitors request rates per partition                             │
│   - If partition exceeds 5,500 GET/sec, split it                        │
│   - Splits happen automatically, transparently                          │
│   - No limit on number of partitions                                    │
│                                                                         │
│   HOT PREFIX HANDLING                                                   │
│   ───────────────────                                                   │
│   Problem: All keys start with "2024/" → hot partition                  │
│   Solution: Add randomized prefix                                       │
│                                                                         │
│   Instead of:  2024/log-001.txt                                         │
│   Use:         a7b3/2024/log-001.txt                                    │
│                                                                         │
│   Distributes load across many partitions                               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Implementation

# index/partition_manager.py

"""
Index Partitioning for S3

Dynamically partitions index to handle 500+ trillion objects.
"""

from dataclasses import dataclass
from typing import List, Optional, Dict
import hashlib


@dataclass
class Partition:
    """An index partition."""
    partition_id: str
    prefix_range: tuple  # (start_prefix, end_prefix)
    request_rate: float
    node_assignment: str


class PartitionManager:
    """
    Manages index partitions for S3.
    
    Key features:
    - Dynamic splitting based on load
    - Prefix-based range partitioning
    - Automatic rebalancing
    
    Applies concepts:
    - Week 1: Hash vs range partitioning
    - Week 1: Hot key handling
    """
    
    # Thresholds for partition operations
    SPLIT_THRESHOLD_RPS = 5500  # GET requests/second
    MIN_PARTITION_SIZE = 1000   # Objects
    
    def __init__(self, partition_store):
        self.partition_store = partition_store
    
    def get_partition_for_key(
        self, 
        bucket: str, 
        key: str
    ) -> Partition:
        """
        Find the partition responsible for a key.
        
        Uses hierarchical lookup:
        1. Hash bucket to bucket partition
        2. Binary search within bucket for key prefix
        """
        # Level 1: Bucket partition
        bucket_hash = self._hash_bucket(bucket)
        bucket_partition = self._get_bucket_partition(bucket_hash)
        
        # Level 2: Key prefix partition
        key_partition = self._find_key_partition(bucket_partition, key)
        
        return key_partition
    
    async def check_and_split(
        self, 
        partition: Partition
    ) -> List[Partition]:
        """
        Check if partition needs splitting due to high load.
        
        Triggered when request rate exceeds threshold.
        """
        if partition.request_rate < self.SPLIT_THRESHOLD_RPS:
            return [partition]  # No split needed
        
        # Split partition in half by prefix
        start, end = partition.prefix_range
        mid = self._find_split_point(partition)
        
        new_partitions = [
            Partition(
                partition_id=f"{partition.partition_id}-a",
                prefix_range=(start, mid),
                request_rate=partition.request_rate / 2,
                node_assignment=self._select_node()
            ),
            Partition(
                partition_id=f"{partition.partition_id}-b",
                prefix_range=(mid, end),
                request_rate=partition.request_rate / 2,
                node_assignment=self._select_node()
            )
        ]
        
        # Atomic partition update
        await self.partition_store.atomic_split(partition, new_partitions)
        
        return new_partitions
    
    def _hash_bucket(self, bucket: str) -> int:
        """Hash bucket name to partition number."""
        return int(hashlib.md5(bucket.encode()).hexdigest(), 16) % 10000
    
    def _find_split_point(self, partition: Partition) -> str:
        """
        Find optimal split point for partition.
        
        Analyzes key distribution to split evenly.
        """
        # Sample keys in partition
        keys = self.partition_store.sample_keys(partition, n=1000)
        
        # Find median prefix
        keys.sort()
        median_key = keys[len(keys) // 2]
        
        # Extract common prefix
        return self._extract_prefix(median_key)

Deep Dive 4: Storage Classes and Lifecycle (Week 4 Concepts)

Interviewer: "How do storage classes work for cost optimization?"

Storage Classes

┌─────────────────────────────────────────────────────────────────────────┐
│                       S3 STORAGE CLASSES                                │
│                                                                         │
│   ┌─────────────────┬─────────────────┬─────────────────┬──────────────┐│
│   │ Storage Class   │ Durability      │ Availability    │ Use Case     ││
│   ├─────────────────┼─────────────────┼─────────────────┼──────────────┤│
│   │ S3 Standard     │ 99.999999999%   │ 99.99%          │ Frequent     ││
│   │                 │ (11 nines)      │                 │ access       ││
│   ├─────────────────┼─────────────────┼─────────────────┼──────────────┤│
│   │ S3 Express      │ 99.999999999%   │ 99.95%          │ Single-digit ││
│   │ One Zone        │                 │ (1 AZ)          │ ms latency   ││
│   ├─────────────────┼─────────────────┼─────────────────┼──────────────┤│
│   │ S3 Standard-IA  │ 99.999999999%   │ 99.9%           │ Infrequent   ││
│   │                 │                 │                 │ access       ││
│   ├─────────────────┼─────────────────┼─────────────────┼──────────────┤│
│   │ S3 One Zone-IA  │ 99.999999999%   │ 99.5%           │ Recreatable  ││
│   │                 │ (in 1 AZ)       │ (1 AZ)          │ data         ││
│   ├─────────────────┼─────────────────┼─────────────────┼──────────────┤│
│   │ S3 Intelligent  │ 99.999999999%   │ 99.9%           │ Unknown      ││
│   │ Tiering         │                 │                 │ access       ││
│   ├─────────────────┼─────────────────┼─────────────────┼──────────────┤│
│   │ S3 Glacier      │ 99.999999999%   │ 99.99%          │ Archive      ││
│   │ Instant         │                 │                 │ (ms access)  ││
│   ├─────────────────┼─────────────────┼─────────────────┼──────────────┤│
│   │ S3 Glacier      │ 99.999999999%   │ 99.99%          │ Archive      ││
│   │ Flexible        │                 │                 │ (min-hours)  ││
│   ├─────────────────┼─────────────────┼─────────────────┼──────────────┤│
│   │ S3 Glacier Deep │ 99.999999999%   │ 99.99%          │ Archive      ││
│   │ Archive         │                 │                 │ (12+ hours)  ││
│   └─────────────────┴─────────────────┴─────────────────┴──────────────┘│
│                                                                         │
│   COST COMPARISON (per GB/month, us-east-1, 2025)                       │
│   ─────────────────────────────────────────────                         │
│   S3 Standard:           $0.023                                         │
│   S3 Standard-IA:        $0.0125                                        │
│   S3 Glacier Instant:    $0.004                                         │
│   S3 Glacier Deep:       $0.00099                                       │
│                                                                         │
│   Glacier Deep is 23x cheaper than Standard!                            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Lifecycle Policies

# lifecycle/policy_engine.py

"""
S3 Lifecycle Policy Engine

Automatically transitions objects between storage classes
based on age and access patterns.
"""

from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime, timedelta
from enum import Enum


class StorageClass(Enum):
    STANDARD = "STANDARD"
    STANDARD_IA = "STANDARD_IA"
    INTELLIGENT_TIERING = "INTELLIGENT_TIERING"
    GLACIER_INSTANT = "GLACIER_IR"
    GLACIER_FLEXIBLE = "GLACIER"
    GLACIER_DEEP = "DEEP_ARCHIVE"


@dataclass
class LifecycleRule:
    """A lifecycle transition rule."""
    rule_id: str
    prefix_filter: Optional[str]
    tag_filter: Optional[dict]
    transitions: List[dict]  # [{days: 30, storage_class: "STANDARD_IA"}]
    expiration_days: Optional[int]


class LifecyclePolicyEngine:
    """
    Evaluates and applies lifecycle policies to objects.
    
    Common pattern:
    1. Standard for 30 days (frequent access)
    2. Standard-IA for 60 days (occasional access)
    3. Glacier for 1 year (rare access)
    4. Delete after 7 years (compliance)
    
    Applies concepts:
    - Week 4: Cache tiering strategies
    - Week 8: Data lifecycle management
    """
    
    def __init__(self, metadata_store, storage_service):
        self.metadata = metadata_store
        self.storage = storage_service
    
    async def evaluate_object(
        self, 
        bucket: str, 
        key: str, 
        rules: List[LifecycleRule]
    ) -> Optional[StorageClass]:
        """
        Determine if object should transition.
        """
        obj_metadata = await self.metadata.get(bucket, key)
        
        for rule in rules:
            if not self._matches_filter(obj_metadata, rule):
                continue
            
            age_days = (datetime.utcnow() - obj_metadata.last_modified).days
            
            # Check expiration
            if rule.expiration_days and age_days >= rule.expiration_days:
                return "DELETE"
            
            # Check transitions (in reverse order - most aggressive first)
            for transition in sorted(
                rule.transitions, 
                key=lambda t: t['days'], 
                reverse=True
            ):
                if age_days >= transition['days']:
                    target_class = StorageClass(transition['storage_class'])
                    if target_class != obj_metadata.storage_class:
                        return target_class
        
        return None  # No transition needed
    
    async def apply_transition(
        self, 
        bucket: str, 
        key: str,
        target_class: StorageClass
    ) -> bool:
        """
        Transition object to new storage class.
        
        For Glacier classes, this involves:
        1. Reading object from current storage
        2. Re-encoding for archive (more aggressive erasure coding)
        3. Writing to archive storage
        4. Updating metadata
        """
        if target_class == "DELETE":
            return await self._delete_object(bucket, key)
        
        # Get current object location
        current_shards = await self.metadata.get_shard_locations(bucket, key)
        
        # Transition to new storage class
        new_shards = await self.storage.transition(
            current_shards, 
            target_class
        )
        
        # Update metadata atomically
        await self.metadata.update_storage_class(
            bucket, key, target_class, new_shards
        )
        
        return True

Phase 5: Scaling and Edge Cases

Scaling Strategies

┌─────────────────────────────────────────────────────────────────────────┐
│                       SCALING STRATEGIES                                │
│                                                                         │
│   AUTOMATIC PARTITION SCALING                                           │
│   ───────────────────────────                                           │
│   - S3 monitors request rates per prefix                                │
│   - Automatically splits hot partitions                                 │
│   - No manual intervention needed                                       │
│   - Transparent to applications                                         │
│                                                                         │
│   WORKLOAD DECORRELATION                                                │
│   ──────────────────────                                                │
│   At S3's scale, individual workload spikes average out:                │
│   - Millions of customers                                               │
│   - Trillions of objects                                                │
│   - Most data is idle most of the time                                  │
│   - Hot spots become statistically rare                                 │
│                                                                         │
│   BEST PRACTICES FOR CLIENTS                                            │
│   ──────────────────────────                                            │
│   1. Use randomized prefixes for high-throughput workloads              │
│      Bad:  logs/2024-01-01/event-001.json                               │
│      Good: a7b3c1/logs/2024-01-01/event-001.json                        │
│                                                                         │
│   2. Implement exponential backoff for 503 errors                       │
│      - Start with 100ms delay                                           │
│      - Double on each retry                                             │
│      - Max 5 retries                                                    │
│                                                                         │
│   3. Use multipart upload for large objects                             │
│      - Parallel part uploads                                            │
│      - Retry individual parts on failure                                │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Edge Cases

EDGE CASE 1: Entire AZ Failure

Scenario: Entire availability zone loses power
Impact: ~33% of shards unavailable

Solution:
  - Erasure coding allows reconstruction from 10 of 16 shards
  - Even with 6 shards in failed AZ, data is recoverable
  - Repair service immediately starts re-replicating
  - No data loss, brief availability impact


EDGE CASE 2: Correlated Disk Failures

Scenario: Heat wave causes accelerated disk failures
Impact: Higher than normal failure rate

Solution:
  - Continuous health monitoring
  - Proactive migration from stressed disks
  - Spare capacity on all disks for fast recovery
  - Parallel re-replication across thousands of drives


EDGE CASE 3: Hot Object

Scenario: Single object gets millions of requests
Impact: Could overwhelm storage nodes

Solution:
  - Caching at front end layer
  - Object replicated to more nodes automatically
  - CloudFront integration for CDN caching
  - Request throttling with 503 errors + backoff


EDGE CASE 4: Concurrent Writes to Same Key

Scenario: Two clients PUT same key simultaneously
Impact: Which version wins?

Solution:
  - Last-writer-wins semantics
  - Timestamps determine winner
  - Versioning available to keep all versions
  - Conditional writes (If-Match) for optimistic locking

Phase 6: Monitoring and Operations

┌──────────────────────────────────────────────────────────────────────────┐
│                       S3 MONITORING DASHBOARD                            │
│                                                                          │
│   DURABILITY HEALTH                                                      │
│   ├── Shard integrity:           100%          [██████████]              │
│   ├── Objects with all shards:   99.9999999%   [██████████]              │
│   ├── Repair queue depth:        12,345        [██░░░░░░░░]              │
│   └── Repair throughput:         1.2M/hour     [████████░░]              │
│                                                                          │
│   AVAILABILITY                                                           │
│   ├── Request success rate:      99.99%        [██████████]              │
│   ├── 503 rate:                  0.001%        [░░░░░░░░░░]              │
│   ├── GET latency (p50):         45ms          [████████░░]              │
│   └── PUT latency (p50):         85ms          [███████░░░]              │
│                                                                          │
│   STORAGE                                                                │
│   ├── Total objects:             512T          [██████████]              │
│   ├── Total storage:             287 EB        [████████░░]              │
│   ├── Daily growth:              +2.1 EB       [███░░░░░░░]              │
│   └── Disk utilization:          67%           [███████░░░]              │
│                                                                          │
│   REQUEST PATTERNS                                                       │
│   ├── GET requests/sec:          187M          [█████████░]              │
│   ├── PUT requests/sec:          23M           [████░░░░░░]              │
│   ├── LIST requests/sec:         8M            [██░░░░░░░░]              │
│   └── Hot partitions:            142           [█░░░░░░░░░]              │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Alerting Strategy

CRITICAL (PagerDuty):
  - Durability below 99.9999999%
  - Any data loss detected
  - Region-wide availability < 99.9%
  - Repair queue > 1M objects

WARNING (Slack):
  - 503 rate > 0.01%
  - Disk utilization > 80%
  - Hot partition count > 500
  - Latency p99 > 500ms

INFO (Dashboard):
  - Daily storage growth
  - Traffic patterns
  - Storage class distribution

Interview Conclusion

Interviewer: "Excellent work. You've covered durability, consistency, and scaling really well. Any questions?"

You: "Thank you! I'm curious about the 2020 strong consistency change. How did AWS achieve it without sacrificing performance?"

Interviewer: "Great question. They invested heavily in their inter-AZ network to make synchronous replication nearly as fast as async. Combined with optimizations to their metadata layer, they essentially eliminated the trade-off."

Summary: Concepts Applied

┌─────────────────────────────────────────────────────────────────────────┐
│            CONCEPTS FROM 10-WEEK COURSE IN S3 DESIGN                    │
│                                                                         │
│   WEEK 1: DATA AT SCALE                                                 │
│   ├── Partitioning: Key-based index partitioning                        │
│   ├── Replication: Multi-AZ data distribution                           │
│   └── Hot keys: Automatic partition splitting                           │
│                                                                         │
│   WEEK 2: FAILURE-FIRST DESIGN                                          │
│   ├── Erasure coding: Survive up to 6 shard failures                    │
│   ├── Checksums: Detect and repair corruption                           │
│   └── Bracketing: Verify before confirming write                        │
│                                                                         │
│   WEEK 4: CACHING                                                       │
│   ├── Storage tiers: Hot, warm, cold, frozen                            │
│   └── Lifecycle: Automatic tier transitions                             │
│                                                                         │
│   WEEK 5: CONSISTENCY                                                   │
│   ├── Strong consistency: Read-after-write                              │
│   ├── Synchronous replication: All replicas before ACK                  │
│   └── Conflict resolution: Last-writer-wins                             │
│                                                                         │
│   WEEK 10: PRODUCTION READINESS                                         │
│   ├── 11 nines durability SLA                                           │
│   ├── Continuous repair monitoring                                      │
│   └── Automatic scaling without intervention                            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                  WHY S3 IS AN ENGINEERING MARVEL                        │
│                                                                         │
│   DURABILITY                                                            │
│   ──────────                                                            │
│   • 11 nines (99.999999999%)                                            │
│   • 1 object lost per 10 million over 10,000 years                      │
│   • Achieved with only 1.6x storage overhead                            │
│                                                                         │
│   SCALE                                                                 │
│   ─────                                                                 │
│   • 500+ trillion objects                                               │
│   • Hundreds of exabytes                                                │
│   • 200+ million requests/second                                        │
│                                                                         │
│   CONSISTENCY                                                           │
│   ───────────                                                           │
│   • Strong read-after-write since 2020                                  │
│   • No stale reads                                                      │
│   • No performance penalty                                              │
│                                                                         │
│   INNOVATION                                                            │
│   ──────────                                                            │
│   • First cloud storage service (2006)                                  │
│   • Defied CAP theorem at scale                                         │
│   • 9 storage classes for every use case                                │
│   • Foundation for AWS ecosystem                                        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Sources

Official AWS:

AWS S3 Documentation: https://docs.aws.amazon.com/s3/
S3 Consistency: https://aws.amazon.com/s3/consistency/
S3 Storage Classes: https://aws.amazon.com/s3/storage-classes/
S3 Performance Guidelines: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Architecture Deep Dives:

ByteByteGo - How S3 Stores 350 Trillion Objects: https://blog.bytebytego.com/p/how-amazon-s3-stores-350-trillion
High Scalability - Behind AWS S3's Massive Scale: https://highscalability.com/behind-aws-s3s-massive-scale/
System Design One - S3 Durability: https://newsletter.systemdesign.one/p/amazon-s3-durability

Strong Consistency:

AWS Blog - S3 Strong Consistency: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
InfoQ - S3 Strong Consistency: https://www.infoq.com/news/2020/12/aws-s3-strong-consistency/

Recent Updates:

Blocks and Files - S3 Updates 2025: https://blocksandfiles.com/2025/12/03/aws-s3/

Self-Assessment Checklist

After studying this case study, you should be able to:

This case study covers AWS S3, the world's most durable object storage system, demonstrating how to achieve 11 nines of durability with strong consistency at planetary scale.

Back to Course Overview