Bonus Problem 8: AWS S3
The World's Most Durable Storage System β 11 Nines That Changed Everything
ποΈ How Do You Promise to Never Lose a Single File Out of 500 Trillion?
Imagine this challenge: You're storing 500 trillion objectsβhundreds of exabytes of dataβfor millions of customers worldwide. Hard drives fail constantly. Data centers lose power. Natural disasters strike. And yet, you promise to lose only 1 object out of 10 million... every 10,000 years.
This is Amazon S3βthe storage system that defied distributed systems theory by delivering both strong consistency AND 99.999999999% durability at planetary scale. It's not just cloud storage. It's an engineering marvel that other services are measured against.
THE AWS S3 SCALE (2025)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β STORAGE β
β βββββββ β
β Objects stored: 500+ Trillion β
β Data stored: Hundreds of Exabytes β
β Single object max size: 5 TB (up to 50 TB in 2025) β
β β
β REQUESTS β
β ββββββββ β
β Requests per second: 200+ Million (peak) β
β GET requests per prefix: 5,500/second β
β PUT requests per prefix: 3,500/second β
β β
β DURABILITY β
β ββββββββββ β
β Durability: 99.999999999% (11 nines) β
β Availability (Standard): 99.99% β
β Availability zones (min): 3 per region β
β β
β ARCHITECTURE β
β ββββββββββββ β
β Microservices: 350+ β
β AWS Regions: 30+ β
β Storage classes: 9 different tiers β
β β
β HISTORY β
β βββββββ β
β Launched: March 14, 2006 β
β First AWS service: Yes (along with SQS) β
β Years of operation: 19 years β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This is the system we'll design todayβand discover how AWS achieved the impossible: strong consistency at 11 nines of durability.
The Interview Begins
You walk into the interview room. The interviewer smiles and gestures to the whiteboard.
Interviewer: "Thanks for coming in. Today we're going to design a distributed object storage system like S3. I'm interested in how you think about durability, consistency, and scale. Please think out loudβthis is collaborative."
They write on the whiteboard:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Design a Distributed Object Storage System β
β β
β Requirements: β
β - Store objects up to 5 TB each β
β - 11 nines (99.999999999%) durability β
β - Strong read-after-write consistency β
β - Handle millions of requests per second β
β - Multiple storage tiers for cost optimization β
β - Global availability across multiple regions β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."
Phase 1: Requirements Clarification (5 minutes)
Before diving in, you take a breath and start asking questions.
Your Questions
You: "Before I start designing, let me clarify some requirements. What's the target scaleβhow many objects and how much data?"
Interviewer: "Think planetary scale. Hundreds of trillions of objects, exabytes of data."
You: "For durability, you mentioned 11 nines. What does that mean in practice?"
Interviewer: "If you store 10 million objects, you might lose 1 object every 10,000 years. Durability is non-negotiableβdata loss is catastrophic."
You: "What about consistency? Traditional distributed systems trade consistency for availability."
Interviewer: "We need strong read-after-write consistency. After a write succeeds, any subsequent read must return the latest data. No stale reads."
You: "What are the latency requirements for reads and writes?"
Interviewer: "First byte latency around 100-200ms for small objects. Large objects can be slower but must support parallel downloads."
You: "Perfect. Let me summarize the requirements."
Functional Requirements
1. OBJECT OPERATIONS
- PUT: Upload objects up to 5 TB
- GET: Download objects (full or byte-range)
- DELETE: Remove objects
- LIST: Enumerate objects with prefix filtering
- HEAD: Get object metadata without content
- Multipart upload for large objects
2. ORGANIZATION
- Buckets: Containers for objects
- Keys: Unique identifiers within buckets
- Prefixes: Logical grouping (like folders)
- Metadata: Custom key-value pairs per object
3. DATA MANAGEMENT
- Versioning: Keep all versions of objects
- Lifecycle policies: Auto-transition storage tiers
- Replication: Cross-region copying
- Encryption: At rest and in transit
4. ACCESS CONTROL
- IAM policies: User-level permissions
- Bucket policies: Resource-level permissions
- ACLs: Fine-grained object access
- Presigned URLs: Temporary access
Non-Functional Requirements
1. DURABILITY
- 99.999999999% (11 nines)
- 1 object lost per 10 million over 10,000 years
- Survive AZ failures, region failures
2. AVAILABILITY
- 99.99% for Standard tier
- 99.9% for infrequent access tiers
- Graceful degradation under load
3. CONSISTENCY
- Strong read-after-write for all operations
- Strong consistency for LIST operations
- No stale reads after successful writes
4. PERFORMANCE
- First byte latency: 100-200ms
- 3,500 PUT/second per prefix
- 5,500 GET/second per prefix
- Unlimited aggregate throughput
Phase 2: Back of the Envelope Estimation (5 minutes)
You: "Let me work through the numbers to understand the scale."
Storage Estimation
OBJECT STORAGE
Scale:
Objects stored: 500 trillion
Average object size: ~500 KB (highly variable)
Total storage: ~250 exabytes
Storage overhead:
Erasure coding overhead: ~1.5x (vs 3x for replication)
Metadata per object: ~1 KB
Metadata storage: 500 trillion Γ 1 KB = 500 PB
With durability overhead:
Raw storage needed: 250 EB Γ 1.5 = 375 EB
Across 3+ AZs: Distributed globally
Request Estimation
REQUEST VOLUME
Peak requests:
Total requests/second: 200+ million
Per-prefix limits:
GET/HEAD per prefix: 5,500/second
PUT/DELETE per prefix: 3,500/second
Daily requests:
200M Γ 86,400 = ~17 trillion requests/day
Key Metrics Summary
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ESTIMATION SUMMARY β
β β
β STORAGE β
β βββ Objects: 500 trillion β
β βββ Raw data: ~250 exabytes β
β βββ With redundancy: ~375 exabytes β
β β
β REQUESTS β
β βββ Peak: 200+ million/second β
β βββ Daily: ~17 trillion β
β βββ Per-prefix GET limit: 5,500/second β
β β
β INFRASTRUCTURE β
β βββ Availability Zones: 3+ per region β
β βββ Regions: 30+ β
β βββ Hard drives: Millions β
β β
β DURABILITY MATH β
β βββ Annual failure rate: ~0.000000001% β
β βββ Objects at risk: 1 in 10 billion/year β
β βββ Expected loss: ~50 objects/year (out of 500T) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Phase 3: High-Level Design (10 minutes)
You: "Now let me sketch out the high-level architecture."
System Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 HIGH-LEVEL ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CLIENTS β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β βAWS SDK β βREST API β βAWS CLI β β Console β β β
β β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β β
β βββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββββββ β
β ββββββββββββββ΄ββββββ¬βββββββ΄βββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ β
β β βΌ β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β FRONT END LAYER β β β
β β β DNS β Load Balancer β Authentication β Routing β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β INDEX/METADATA LAYER β β β
β β β β β β
β β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β β
β β β β Partition β β Global β β Lifecycle β β β β
β β β β Engine β β Metadata β β Manager β β β β
β β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β STORAGE LAYER β β β
β β β β β β
β β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β β
β β β β Storage β β Erasure β β Repair β β β β
β β β β Nodes β β Coding β β Service β β β β
β β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β β
β β β β β β
β β β AZ-1 AZ-2 AZ-3 β β β
β β β βββββββββ βββββββββ βββββββββ β β β
β β β β Disks β β Disks β β Disks β β β β
β β β βββββββββ βββββββββ βββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β AWS REGION β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Breakdown
1. Front End Layer
Purpose: Handle all incoming requests, authentication, routing
Key components:
- DNS Service: Route requests to correct region/endpoint
- Load Balancers: Distribute traffic across API servers
- Authentication: Validate IAM credentials, bucket policies
- Request Router: Direct to correct storage nodes
2. Index/Metadata Layer
Purpose: Track object locations without storing data
Key components:
- Partition Engine: Distribute keys across index partitions
- Global Metadata Store: Map keys to physical storage locations
- Lifecycle Manager: Handle versioning, retention, expiration
3. Storage Layer
Purpose: Physical storage with durability guarantees
Key components:
- Storage Nodes: Thousands of servers with HDDs/SSDs
- Erasure Coding: Split data into shards with parity
- Repair Service: Detect and recover from failures
- Cross-AZ Distribution: Spread across availability zones
Data Flow: PUT Object
PUT OBJECT FLOW
1. CLIENT REQUEST
PUT /bucket/key HTTP/1.1
Content-Length: 1048576
x-amz-content-sha256: <checksum>
2. FRONT END
βββ DNS resolves to regional endpoint
βββ Load balancer routes to API server
βββ Authenticate request (SigV4)
βββ Validate bucket policy
3. INDEX LAYER
βββ Hash key to determine partition
βββ Acquire write lock on key
βββ Generate storage location
4. STORAGE LAYER
βββ Erasure code the data (e.g., 10 data + 6 parity)
βββ Write shards to storage nodes across AZs
βββ Calculate and store checksums
βββ BRACKETING: Verify data can be reconstructed
5. COMMIT
βββ Update metadata index
βββ Release write lock
βββ Return 200 OK to client
DURABILITY: Data survives if ANY 10 of 16 shards survive
Phase 4: Deep Dives (20 minutes)
Interviewer: "Let's dive into how you achieve 11 nines of durability."
Deep Dive 1: Erasure Coding for Durability (Week 1 Concepts)
The Problem
DURABILITY CHALLENGE
Hard drive failure rates:
- Annual failure rate: 1-5% per drive
- At S3 scale: Thousands of drives fail daily
- Must lose ZERO customer data
Simple replication (3x):
- Store 3 copies across AZs
- Problem: 3x storage overhead
- Problem: All 3 could fail (rare but possible)
Need: High durability with lower storage overhead
The Solution: Erasure Coding
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ERASURE CODING ARCHITECTURE β
β β
β ORIGINAL DATA (1 MB object) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1 MB β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ERASURE ENCODE (Reed-Solomon 10+6) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Split into 10 data shards + compute 6 parity shards β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β 16 SHARDS (~100 KB each) β
β ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ
β β D1 β D2 β D3 β D4 β D5 β D6 β D7 β D8 β D9 βD10 β P1 β P2 β P3 β P4 β P5 β P6 β
β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ
β β β
β βΌ β
β DISTRIBUTE ACROSS AZs AND DRIVES β
β β
β AZ-1 (drives A-E) AZ-2 (drives F-J) AZ-3 (drives K-P) β
β ββββββ¬βββββ¬βββββ ββββββ¬βββββ¬βββββ ββββββ¬βββββ¬βββββ β
β β D1 β D4 β D7 β β D2 β D5 β D8 β β D3 β D6 β D9 β β
β β P1 β P4 βD10 β β P2 β P5 β β β P3 β P6 β β β
β ββββββ΄βββββ΄βββββ ββββββ΄βββββ΄βββββ ββββββ΄βββββ΄βββββ β
β β
β RECOVERY: Can reconstruct from ANY 10 of 16 shards β
β Can lose up to 6 shards (or entire AZ) and still recover β
β β
β OVERHEAD: 1.6x storage (vs 3x for replication) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation
# storage/erasure_coding.py
"""
Erasure Coding for 11 Nines Durability
Uses Reed-Solomon coding to achieve extreme durability
with only 1.6x storage overhead (vs 3x for replication).
"""
from dataclasses import dataclass
from typing import List, Optional, Tuple
import hashlib
@dataclass
class Shard:
"""A single shard of erasure-coded data."""
shard_id: int
data: bytes
checksum: str
storage_node: str
availability_zone: str
@dataclass
class ErasureConfig:
"""Configuration for erasure coding."""
data_shards: int = 10 # K: minimum shards to reconstruct
parity_shards: int = 6 # M: redundant shards
# Total = 16 shards, can lose up to 6
class ErasureCodingService:
"""
Erasure coding service for S3 durability.
Uses Reed-Solomon (10, 6) coding:
- Split data into 10 data shards
- Compute 6 parity shards
- Can reconstruct from any 10 of 16 shards
Applies concepts:
- Week 1: Replication and partitioning
- Week 1: Data durability strategies
"""
def __init__(self, config: ErasureConfig, storage_nodes: List):
self.config = config
self.storage_nodes = storage_nodes
self.total_shards = config.data_shards + config.parity_shards
async def encode_and_store(
self,
object_key: str,
data: bytes
) -> List[Shard]:
"""
Encode data and distribute shards across AZs.
Steps:
1. Calculate checksum of original data
2. Split into data shards
3. Compute parity shards
4. Distribute across AZs (ensuring AZ diversity)
5. Verify (bracketing) before confirming
"""
# Step 1: Original checksum
original_checksum = self._compute_checksum(data)
# Step 2: Split into data shards
data_shards = self._split_data(data, self.config.data_shards)
# Step 3: Compute parity shards
parity_shards = self._compute_parity(data_shards)
# Step 4: Select storage nodes (AZ-aware)
all_shards = data_shards + parity_shards
shard_placements = self._select_nodes(len(all_shards))
# Step 5: Write shards to storage
stored_shards = []
for i, (shard_data, (node, az)) in enumerate(
zip(all_shards, shard_placements)
):
checksum = self._compute_checksum(shard_data)
shard = Shard(
shard_id=i,
data=shard_data,
checksum=checksum,
storage_node=node,
availability_zone=az
)
await self._write_shard(shard)
stored_shards.append(shard)
# Step 6: BRACKETING - verify we can reconstruct
if not await self._verify_reconstruction(
stored_shards, original_checksum
):
raise DurabilityError("Bracketing verification failed")
return stored_shards
async def read_and_decode(
self,
shards: List[Shard]
) -> bytes:
"""
Read shards and reconstruct original data.
Can reconstruct from any K (10) of N (16) shards.
"""
# Read available shards
available = []
for shard in shards:
try:
data = await self._read_shard(shard)
if self._verify_checksum(data, shard.checksum):
available.append((shard.shard_id, data))
except Exception:
continue # Shard unavailable
# Need at least K shards
if len(available) < self.config.data_shards:
raise DurabilityError(
f"Only {len(available)} shards available, "
f"need {self.config.data_shards}"
)
# Reconstruct using Reed-Solomon
return self._reconstruct(available)
def _select_nodes(self, count: int) -> List[Tuple[str, str]]:
"""
Select storage nodes ensuring AZ diversity.
Critical: No more than ceil(count/3) shards per AZ
This ensures we can lose an entire AZ and still reconstruct.
"""
placements = []
az_counts = {}
max_per_az = (count + 2) // 3 # Ceiling division
for node in self.storage_nodes:
az = node.availability_zone
if az_counts.get(az, 0) < max_per_az:
placements.append((node.id, az))
az_counts[az] = az_counts.get(az, 0) + 1
if len(placements) == count:
break
return placements
async def _verify_reconstruction(
self,
shards: List[Shard],
expected_checksum: str
) -> bool:
"""
Bracketing: Verify data can be reconstructed.
This is critical - we don't confirm write success
until we've proven the data is recoverable.
"""
try:
reconstructed = await self.read_and_decode(shards)
actual_checksum = self._compute_checksum(reconstructed)
return actual_checksum == expected_checksum
except Exception:
return False
def _compute_checksum(self, data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
Deep Dive 2: Strong Consistency (Week 5 Concepts)
Interviewer: "S3 moved from eventual to strong consistency in 2020. How?"
The Problem
EVENTUAL CONSISTENCY ISSUES (pre-2020)
Scenario 1: Read-after-write
PUT /bucket/key β 200 OK
GET /bucket/key β 404 Not Found (stale!)
Scenario 2: Read-after-update
PUT /bucket/key v1 β 200 OK
PUT /bucket/key v2 β 200 OK
GET /bucket/key β returns v1 (stale!)
Why it happened:
- Metadata replicated asynchronously
- Different nodes might see different versions
- LIST operations even worse
Impact:
- Data lakes: Missing files in queries
- Analytics: Incorrect results
- Required workarounds: Sleep, external databases
The Solution: Strong Consistency
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STRONG CONSISTENCY ARCHITECTURE β
β β
β KEY INSIGHT: Synchronous metadata replication β
β β
β PUT FLOW (with strong consistency) β
β ββββββββββββββββββββββββββββββββββ β
β β
β 1. Client sends PUT β
β β β
β βΌ β
β 2. Write data to storage layer β
β β β
β βΌ β
β 3. Update metadata (SYNCHRONOUSLY) β
β βββ Write to primary metadata node β
β βββ Replicate to ALL metadata replicas β
β βββ Wait for acknowledgment from majority β
β β β
β βΌ β
β 4. Return 200 OK only after metadata is consistent β
β β
β GET FLOW (after strong consistency) β
β βββββββββββββββββββββββββββββββββββ β
β β
β 1. Read from ANY metadata node β
β - All nodes guaranteed to have latest β
β - No more stale reads β
β β
β TRADE-OFFS β
β ββββββββββ β
β Before: Higher throughput, eventual consistency β
β After: Same throughput(!), strong consistency β
β β
β AWS achieved this through: β
β - Optimized network between AZs β
β - Pipelining metadata updates β
β - Reducing partition risk to near-zero β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation
# metadata/consistency.py
"""
Strong Consistency for S3 Metadata
Ensures read-after-write consistency for all operations.
"""
from dataclasses import dataclass
from typing import Optional, List
from datetime import datetime
import asyncio
@dataclass
class MetadataEntry:
"""Object metadata entry."""
bucket: str
key: str
version_id: str
etag: str
size: int
storage_class: str
last_modified: datetime
shard_locations: List[str]
class StrongConsistencyMetadataStore:
"""
Metadata store with strong read-after-write consistency.
Key mechanism: Synchronous replication to all replicas
before acknowledging write success.
Applies concepts:
- Week 5: Strong vs eventual consistency
- Week 5: Consensus protocols
"""
def __init__(self, replicas: List, quorum_size: int):
self.replicas = replicas
self.quorum_size = quorum_size # Majority
async def put_metadata(
self,
entry: MetadataEntry
) -> bool:
"""
Store metadata with strong consistency.
Only returns success after majority of replicas
have durably stored the metadata.
"""
# Generate monotonic version
entry.version_id = self._generate_version_id()
# Write to all replicas in parallel
write_tasks = [
self._write_to_replica(replica, entry)
for replica in self.replicas
]
# Wait for quorum (majority)
results = await asyncio.gather(
*write_tasks,
return_exceptions=True
)
successes = sum(1 for r in results if r is True)
if successes >= self.quorum_size:
# Quorum achieved - write is durable and consistent
return True
else:
# Rollback on failure
await self._rollback(entry)
return False
async def get_metadata(
self,
bucket: str,
key: str
) -> Optional[MetadataEntry]:
"""
Read metadata - guaranteed to return latest version.
With strong consistency, we can read from any replica.
All replicas are guaranteed to have the latest data.
"""
# Read from nearest replica (any will do)
replica = self._select_nearest_replica()
entry = await self._read_from_replica(replica, bucket, key)
return entry
async def list_objects(
self,
bucket: str,
prefix: str
) -> List[MetadataEntry]:
"""
List objects - strongly consistent.
After a PUT, the object is immediately visible in LIST.
"""
replica = self._select_nearest_replica()
# Read from partition responsible for this prefix
partition = self._get_partition(bucket, prefix)
return await self._list_from_partition(replica, partition, prefix)
def _generate_version_id(self) -> str:
"""Generate monotonically increasing version ID."""
# Combination of timestamp + node ID + sequence
import uuid
return str(uuid.uuid4())
Deep Dive 3: Index Partitioning (Week 1 Concepts)
Interviewer: "How do you index 500 trillion objects?"
The Problem
INDEXING 500 TRILLION OBJECTS
Challenge:
- Cannot fit in single index (way too large)
- Cannot scan linearly (too slow)
- Must support prefix queries
- Must handle hot prefixes
Requirements:
- O(1) lookups by key
- O(n) prefix scans (where n = matching keys)
- Automatic scaling
- No hotspots
The Solution: Key Partitioning
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INDEX PARTITIONING ARCHITECTURE β
β β
β OBJECT KEY: s3://my-bucket/2024/logs/server1/app.log β
β β
β PARTITIONING STRATEGY β
β βββββββββββββββββββββ β
β β
β 1. Hash bucket name β Bucket partition β
β hash("my-bucket") β Partition 42 β
β β
β 2. Within bucket, partition by key prefix β
β "2024/logs" β Sub-partition 7 β
β β
β 3. Keys sorted within partition for efficient prefix scans β
β β
β PARTITION STRUCTURE β
β βββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β BUCKET PARTITION 42 β β
β β β β
β β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β β
β β β Sub-part 1 β β Sub-part 2 β β Sub-part 7 β ... β β
β β β prefix: "" β β prefix: "a" β β prefix:"2024" β β β
β β β β β β β β β β
β β β key1 β loc β β a/b β loc β β 2024/logs/ β β β β
β β β key2 β loc β β a/c β loc β β 2024/data/ β β β β
β β β ... β β ... β β ... β β β
β β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β DYNAMIC PARTITIONING β
β ββββββββββββββββββββ β
β - S3 monitors request rates per partition β
β - If partition exceeds 5,500 GET/sec, split it β
β - Splits happen automatically, transparently β
β - No limit on number of partitions β
β β
β HOT PREFIX HANDLING β
β βββββββββββββββββββ β
β Problem: All keys start with "2024/" β hot partition β
β Solution: Add randomized prefix β
β β
β Instead of: 2024/log-001.txt β
β Use: a7b3/2024/log-001.txt β
β β
β Distributes load across many partitions β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation
# index/partition_manager.py
"""
Index Partitioning for S3
Dynamically partitions index to handle 500+ trillion objects.
"""
from dataclasses import dataclass
from typing import List, Optional, Dict
import hashlib
@dataclass
class Partition:
"""An index partition."""
partition_id: str
prefix_range: tuple # (start_prefix, end_prefix)
request_rate: float
node_assignment: str
class PartitionManager:
"""
Manages index partitions for S3.
Key features:
- Dynamic splitting based on load
- Prefix-based range partitioning
- Automatic rebalancing
Applies concepts:
- Week 1: Hash vs range partitioning
- Week 1: Hot key handling
"""
# Thresholds for partition operations
SPLIT_THRESHOLD_RPS = 5500 # GET requests/second
MIN_PARTITION_SIZE = 1000 # Objects
def __init__(self, partition_store):
self.partition_store = partition_store
def get_partition_for_key(
self,
bucket: str,
key: str
) -> Partition:
"""
Find the partition responsible for a key.
Uses hierarchical lookup:
1. Hash bucket to bucket partition
2. Binary search within bucket for key prefix
"""
# Level 1: Bucket partition
bucket_hash = self._hash_bucket(bucket)
bucket_partition = self._get_bucket_partition(bucket_hash)
# Level 2: Key prefix partition
key_partition = self._find_key_partition(bucket_partition, key)
return key_partition
async def check_and_split(
self,
partition: Partition
) -> List[Partition]:
"""
Check if partition needs splitting due to high load.
Triggered when request rate exceeds threshold.
"""
if partition.request_rate < self.SPLIT_THRESHOLD_RPS:
return [partition] # No split needed
# Split partition in half by prefix
start, end = partition.prefix_range
mid = self._find_split_point(partition)
new_partitions = [
Partition(
partition_id=f"{partition.partition_id}-a",
prefix_range=(start, mid),
request_rate=partition.request_rate / 2,
node_assignment=self._select_node()
),
Partition(
partition_id=f"{partition.partition_id}-b",
prefix_range=(mid, end),
request_rate=partition.request_rate / 2,
node_assignment=self._select_node()
)
]
# Atomic partition update
await self.partition_store.atomic_split(partition, new_partitions)
return new_partitions
def _hash_bucket(self, bucket: str) -> int:
"""Hash bucket name to partition number."""
return int(hashlib.md5(bucket.encode()).hexdigest(), 16) % 10000
def _find_split_point(self, partition: Partition) -> str:
"""
Find optimal split point for partition.
Analyzes key distribution to split evenly.
"""
# Sample keys in partition
keys = self.partition_store.sample_keys(partition, n=1000)
# Find median prefix
keys.sort()
median_key = keys[len(keys) // 2]
# Extract common prefix
return self._extract_prefix(median_key)
Deep Dive 4: Storage Classes and Lifecycle (Week 4 Concepts)
Interviewer: "How do storage classes work for cost optimization?"
Storage Classes
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 STORAGE CLASSES β
β β
β βββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββ
β β Storage Class β Durability β Availability β Use Case ββ
β βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββ€β
β β S3 Standard β 99.999999999% β 99.99% β Frequent ββ
β β β (11 nines) β β access ββ
β βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββ€β
β β S3 Express β 99.999999999% β 99.95% β Single-digit ββ
β β One Zone β β (1 AZ) β ms latency ββ
β βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββ€β
β β S3 Standard-IA β 99.999999999% β 99.9% β Infrequent ββ
β β β β β access ββ
β βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββ€β
β β S3 One Zone-IA β 99.999999999% β 99.5% β Recreatable ββ
β β β (in 1 AZ) β (1 AZ) β data ββ
β βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββ€β
β β S3 Intelligent β 99.999999999% β 99.9% β Unknown ββ
β β Tiering β β β access ββ
β βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββ€β
β β S3 Glacier β 99.999999999% β 99.99% β Archive ββ
β β Instant β β β (ms access) ββ
β βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββ€β
β β S3 Glacier β 99.999999999% β 99.99% β Archive ββ
β β Flexible β β β (min-hours) ββ
β βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββΌβββββββββββββββ€β
β β S3 Glacier Deep β 99.999999999% β 99.99% β Archive ββ
β β Archive β β β (12+ hours) ββ
β βββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββ
β β
β COST COMPARISON (per GB/month, us-east-1, 2025) β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β S3 Standard: $0.023 β
β S3 Standard-IA: $0.0125 β
β S3 Glacier Instant: $0.004 β
β S3 Glacier Deep: $0.00099 β
β β
β Glacier Deep is 23x cheaper than Standard! β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lifecycle Policies
# lifecycle/policy_engine.py
"""
S3 Lifecycle Policy Engine
Automatically transitions objects between storage classes
based on age and access patterns.
"""
from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime, timedelta
from enum import Enum
class StorageClass(Enum):
STANDARD = "STANDARD"
STANDARD_IA = "STANDARD_IA"
INTELLIGENT_TIERING = "INTELLIGENT_TIERING"
GLACIER_INSTANT = "GLACIER_IR"
GLACIER_FLEXIBLE = "GLACIER"
GLACIER_DEEP = "DEEP_ARCHIVE"
@dataclass
class LifecycleRule:
"""A lifecycle transition rule."""
rule_id: str
prefix_filter: Optional[str]
tag_filter: Optional[dict]
transitions: List[dict] # [{days: 30, storage_class: "STANDARD_IA"}]
expiration_days: Optional[int]
class LifecyclePolicyEngine:
"""
Evaluates and applies lifecycle policies to objects.
Common pattern:
1. Standard for 30 days (frequent access)
2. Standard-IA for 60 days (occasional access)
3. Glacier for 1 year (rare access)
4. Delete after 7 years (compliance)
Applies concepts:
- Week 4: Cache tiering strategies
- Week 8: Data lifecycle management
"""
def __init__(self, metadata_store, storage_service):
self.metadata = metadata_store
self.storage = storage_service
async def evaluate_object(
self,
bucket: str,
key: str,
rules: List[LifecycleRule]
) -> Optional[StorageClass]:
"""
Determine if object should transition.
"""
obj_metadata = await self.metadata.get(bucket, key)
for rule in rules:
if not self._matches_filter(obj_metadata, rule):
continue
age_days = (datetime.utcnow() - obj_metadata.last_modified).days
# Check expiration
if rule.expiration_days and age_days >= rule.expiration_days:
return "DELETE"
# Check transitions (in reverse order - most aggressive first)
for transition in sorted(
rule.transitions,
key=lambda t: t['days'],
reverse=True
):
if age_days >= transition['days']:
target_class = StorageClass(transition['storage_class'])
if target_class != obj_metadata.storage_class:
return target_class
return None # No transition needed
async def apply_transition(
self,
bucket: str,
key: str,
target_class: StorageClass
) -> bool:
"""
Transition object to new storage class.
For Glacier classes, this involves:
1. Reading object from current storage
2. Re-encoding for archive (more aggressive erasure coding)
3. Writing to archive storage
4. Updating metadata
"""
if target_class == "DELETE":
return await self._delete_object(bucket, key)
# Get current object location
current_shards = await self.metadata.get_shard_locations(bucket, key)
# Transition to new storage class
new_shards = await self.storage.transition(
current_shards,
target_class
)
# Update metadata atomically
await self.metadata.update_storage_class(
bucket, key, target_class, new_shards
)
return True
Phase 5: Scaling and Edge Cases
Scaling Strategies
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SCALING STRATEGIES β
β β
β AUTOMATIC PARTITION SCALING β
β βββββββββββββββββββββββββββ β
β - S3 monitors request rates per prefix β
β - Automatically splits hot partitions β
β - No manual intervention needed β
β - Transparent to applications β
β β
β WORKLOAD DECORRELATION β
β ββββββββββββββββββββββ β
β At S3's scale, individual workload spikes average out: β
β - Millions of customers β
β - Trillions of objects β
β - Most data is idle most of the time β
β - Hot spots become statistically rare β
β β
β BEST PRACTICES FOR CLIENTS β
β ββββββββββββββββββββββββββ β
β 1. Use randomized prefixes for high-throughput workloads β
β Bad: logs/2024-01-01/event-001.json β
β Good: a7b3c1/logs/2024-01-01/event-001.json β
β β
β 2. Implement exponential backoff for 503 errors β
β - Start with 100ms delay β
β - Double on each retry β
β - Max 5 retries β
β β
β 3. Use multipart upload for large objects β
β - Parallel part uploads β
β - Retry individual parts on failure β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Edge Cases
EDGE CASE 1: Entire AZ Failure
Scenario: Entire availability zone loses power
Impact: ~33% of shards unavailable
Solution:
- Erasure coding allows reconstruction from 10 of 16 shards
- Even with 6 shards in failed AZ, data is recoverable
- Repair service immediately starts re-replicating
- No data loss, brief availability impact
EDGE CASE 2: Correlated Disk Failures
Scenario: Heat wave causes accelerated disk failures
Impact: Higher than normal failure rate
Solution:
- Continuous health monitoring
- Proactive migration from stressed disks
- Spare capacity on all disks for fast recovery
- Parallel re-replication across thousands of drives
EDGE CASE 3: Hot Object
Scenario: Single object gets millions of requests
Impact: Could overwhelm storage nodes
Solution:
- Caching at front end layer
- Object replicated to more nodes automatically
- CloudFront integration for CDN caching
- Request throttling with 503 errors + backoff
EDGE CASE 4: Concurrent Writes to Same Key
Scenario: Two clients PUT same key simultaneously
Impact: Which version wins?
Solution:
- Last-writer-wins semantics
- Timestamps determine winner
- Versioning available to keep all versions
- Conditional writes (If-Match) for optimistic locking
Phase 6: Monitoring and Operations
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 MONITORING DASHBOARD β
β β
β DURABILITY HEALTH β
β βββ Shard integrity: 100% [ββββββββββ] β
β βββ Objects with all shards: 99.9999999% [ββββββββββ] β
β βββ Repair queue depth: 12,345 [ββββββββββ] β
β βββ Repair throughput: 1.2M/hour [ββββββββββ] β
β β
β AVAILABILITY β
β βββ Request success rate: 99.99% [ββββββββββ] β
β βββ 503 rate: 0.001% [ββββββββββ] β
β βββ GET latency (p50): 45ms [ββββββββββ] β
β βββ PUT latency (p50): 85ms [ββββββββββ] β
β β
β STORAGE β
β βββ Total objects: 512T [ββββββββββ] β
β βββ Total storage: 287 EB [ββββββββββ] β
β βββ Daily growth: +2.1 EB [ββββββββββ] β
β βββ Disk utilization: 67% [ββββββββββ] β
β β
β REQUEST PATTERNS β
β βββ GET requests/sec: 187M [ββββββββββ] β
β βββ PUT requests/sec: 23M [ββββββββββ] β
β βββ LIST requests/sec: 8M [ββββββββββ] β
β βββ Hot partitions: 142 [ββββββββββ] β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Alerting Strategy
CRITICAL (PagerDuty):
- Durability below 99.9999999%
- Any data loss detected
- Region-wide availability < 99.9%
- Repair queue > 1M objects
WARNING (Slack):
- 503 rate > 0.01%
- Disk utilization > 80%
- Hot partition count > 500
- Latency p99 > 500ms
INFO (Dashboard):
- Daily storage growth
- Traffic patterns
- Storage class distribution
Interview Conclusion
Interviewer: "Excellent work. You've covered durability, consistency, and scaling really well. Any questions?"
You: "Thank you! I'm curious about the 2020 strong consistency change. How did AWS achieve it without sacrificing performance?"
Interviewer: "Great question. They invested heavily in their inter-AZ network to make synchronous replication nearly as fast as async. Combined with optimizations to their metadata layer, they essentially eliminated the trade-off."
Summary: Concepts Applied
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONCEPTS FROM 10-WEEK COURSE IN S3 DESIGN β
β β
β WEEK 1: DATA AT SCALE β
β βββ Partitioning: Key-based index partitioning β
β βββ Replication: Multi-AZ data distribution β
β βββ Hot keys: Automatic partition splitting β
β β
β WEEK 2: FAILURE-FIRST DESIGN β
β βββ Erasure coding: Survive up to 6 shard failures β
β βββ Checksums: Detect and repair corruption β
β βββ Bracketing: Verify before confirming write β
β β
β WEEK 4: CACHING β
β βββ Storage tiers: Hot, warm, cold, frozen β
β βββ Lifecycle: Automatic tier transitions β
β β
β WEEK 5: CONSISTENCY β
β βββ Strong consistency: Read-after-write β
β βββ Synchronous replication: All replicas before ACK β
β βββ Conflict resolution: Last-writer-wins β
β β
β WEEK 10: PRODUCTION READINESS β
β βββ 11 nines durability SLA β
β βββ Continuous repair monitoring β
β βββ Automatic scaling without intervention β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WHY S3 IS AN ENGINEERING MARVEL β
β β
β DURABILITY β
β ββββββββββ β
β β’ 11 nines (99.999999999%) β
β β’ 1 object lost per 10 million over 10,000 years β
β β’ Achieved with only 1.6x storage overhead β
β β
β SCALE β
β βββββ β
β β’ 500+ trillion objects β
β β’ Hundreds of exabytes β
β β’ 200+ million requests/second β
β β
β CONSISTENCY β
β βββββββββββ β
β β’ Strong read-after-write since 2020 β
β β’ No stale reads β
β β’ No performance penalty β
β β
β INNOVATION β
β ββββββββββ β
β β’ First cloud storage service (2006) β
β β’ Defied CAP theorem at scale β
β β’ 9 storage classes for every use case β
β β’ Foundation for AWS ecosystem β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sources
Official AWS:
- AWS S3 Documentation: https://docs.aws.amazon.com/s3/
- S3 Consistency: https://aws.amazon.com/s3/consistency/
- S3 Storage Classes: https://aws.amazon.com/s3/storage-classes/
- S3 Performance Guidelines: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
Architecture Deep Dives:
- ByteByteGo - How S3 Stores 350 Trillion Objects: https://blog.bytebytego.com/p/how-amazon-s3-stores-350-trillion
- High Scalability - Behind AWS S3's Massive Scale: https://highscalability.com/behind-aws-s3s-massive-scale/
- System Design One - S3 Durability: https://newsletter.systemdesign.one/p/amazon-s3-durability
Strong Consistency:
- AWS Blog - S3 Strong Consistency: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
- InfoQ - S3 Strong Consistency: https://www.infoq.com/news/2020/12/aws-s3-strong-consistency/
Recent Updates:
- Blocks and Files - S3 Updates 2025: https://blocksandfiles.com/2025/12/03/aws-s3/
Self-Assessment Checklist
After studying this case study, you should be able to:
- Explain erasure coding and why it's superior to replication for durability
- Calculate durability math (11 nines)
- Design an index partitioning scheme for trillions of objects
- Explain how S3 achieved strong consistency without sacrificing performance
- Describe bracketing and why it's critical for durability
- Design lifecycle policies for cost optimization
- Handle hot prefixes through randomization
- Implement exponential backoff for 503 errors
- Explain the trade-offs between different storage classes
- Design for correlated failures (AZ failure, disk batch failure)
- Monitor durability and availability at scale
- Articulate why S3 is considered the "greatest cloud service"
This case study covers AWS S3, the world's most durable object storage system, demonstrating how to achieve 11 nines of durability with strong consistency at planetary scale.