Bonus Problem 2: WhatsApp Messaging
How 50 Engineers Built the World's Most Efficient Messaging Platform
π± The Impossible Made Simple
When Facebook acquired WhatsApp for $19 billion in 2014, something didn't add up.
The messaging app was handling 50 billion messages daily β but with only 32 engineers on the backend.
How is that even possible?
THE WHATSAPP PARADOX (2014 Acquisition)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β MESSAGES PER DAY ENGINEERS β
β ββββββββββββββββββ βββββββββ β
β 50 Billion 32 (backend) β
β 50 (total) β
β β
β SERVERS COST PER USER β
β βββββββ βββββββββββββ β
β ~550 Fraction of competitors β
β β
β CONNECTIONS PER SERVER TECHNOLOGY STACK β
β ββββββββββββββββββββββ ββββββββββββββββ β
β 2-3 Million Erlang + FreeBSD β
β (Industry: ~100K typical) (Not Java, not Linux) β
β β
β For comparison: β
β β’ Facebook Messenger: 1000+ engineers β
β β’ Most messaging apps: 100-500 engineers for similar scale β
β β
β WhatsApp achieved 20x efficiency. How? β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TODAY'S SCALE (2025)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β MONTHLY ACTIVE USERS DAILY MESSAGES β
β ββββββββββββββββββββ ββββββββββββββ β
β 2.5+ Billion 100+ Billion β
β β
β PEAK MESSAGES/SECOND COUNTRIES β
β βββββββββββββββββββ βββββββββ β
β 3+ Million 180+ β
β β
β STATUS UPDATES/DAY VOICE/VIDEO CALLS β
β ββββββββββββββββββ βββββββββββββββββ β
β 500+ Million 2+ Billion minutes/day β
β β
β ENCRYPTION UPTIME β
β ββββββββββ ββββββ β
β End-to-End (Signal Protocol) 99.99%+ β
β WhatsApp can't read messages β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This is the system we'll design today β and understand why their technology choices matter.
The Interview Begins
You're interviewing at a startup that wants to build the next great messaging app. The CTO draws on the whiteboard:
Interviewer: "WhatsApp proved that a small team can build a messaging platform for billions. I want you to design a messaging system that can scale to that level. Walk me through how you'd approach it."
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Design a Global Messaging Platform β
β β
β Build a messaging system that can handle billions of users with: β
β β
β Requirements: β
β β’ Support 2 billion users β
β β’ Handle 100+ billion messages per day β
β β’ Real-time message delivery (< 500ms for online users) β
β β’ Offline message storage and delivery β
β β’ End-to-end encryption (server cannot read messages) β
β β’ Group messaging (up to 1000 members) β
β β’ Media sharing (images, video, documents) β
β β’ Presence (online/offline/typing indicators) β
β β’ Message status (sent, delivered, read) β
β β’ 99.99% availability β
β β
β Constraint: Build it with a small team (< 100 engineers) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interviewer: "The team size constraint is intentional. I want to see if you understand that architecture choices determine how many engineers you need. WhatsApp proved you can do more with less."
Phase 1: Requirements Clarification
You: "Let me make sure I understand the constraints and can make the right trade-offs."
Your Questions
You: "First, what's the message delivery guarantee? Is it okay if a message is delayed, or must it be truly real-time?"
Interviewer: "For online users, we want sub-second delivery. For offline users, messages should be queued and delivered when they come online. Messages should never be lost."
You: "What about message ordering? If I send two messages quickly, must they arrive in order?"
Interviewer: "Within a single conversation, yes β messages must be ordered. But I don't care if Message A to Bob arrives before Message B to Alice, even if B was sent first."
You: "For end-to-end encryption, does the server ever need to read message content?"
Interviewer: "Never. The server should only route encrypted blobs. This is non-negotiable for user trust."
You: "What's the read/write ratio? How often do users check for messages vs send them?"
Interviewer: "Heavy read. Users open the app frequently to check for messages. For every message sent, there are probably 10 message reads/checks. Also, presence updates are extremely chatty β typing indicators, online status."
You: "What about media? Video messages can be large."
Interviewer: "Media follows a different path. Upload to storage, send a reference in the message. The encrypted media is stored separately from chat servers."
Requirements Summary
Functional Requirements:
1. MESSAGING
β’ One-to-one messaging
β’ Group messaging (up to 1024 members)
β’ Message types: text, image, video, audio, document, location
β’ Message status: sent β delivered β read
β’ Reply, forward, delete (for everyone)
2. PRESENCE
β’ Online/offline status
β’ Last seen timestamp
β’ Typing indicators
β’ Privacy controls (hide last seen)
3. MEDIA HANDLING
β’ Upload and download media
β’ Automatic compression
β’ Progressive loading (blurry β clear)
β’ Media forwarding (reuse uploaded media)
4. SECURITY
β’ End-to-end encryption for all messages
β’ Device verification (QR code scan)
β’ Two-factor authentication
5. SYNC
β’ Multi-device support
β’ Message sync across devices
β’ Local backup (encrypted)
Non-Functional Requirements:
SCALE
β’ 2 billion users
β’ 500 million daily active users
β’ 100 billion messages/day
β’ 3 million messages/second (peak)
LATENCY
β’ Online delivery: < 500ms
β’ Offline delivery: < 5s after coming online
β’ Typing indicators: < 200ms
AVAILABILITY
β’ 99.99% uptime
β’ Graceful degradation (messages queue if delays)
STORAGE
β’ Messages stored until delivered
β’ Delivered messages deleted from server
β’ Media retained for 30 days
EFFICIENCY
β’ Minimal bandwidth (works on 2G)
β’ Minimal battery drain
β’ Small engineering team
Phase 2: Back of the Envelope Estimation
You: "Let me work through the numbers to understand infrastructure needs."
Traffic Calculations
MESSAGE VOLUME
Daily messages: 100,000,000,000 (100 billion)
Seconds per day: 86,400
Average MPS: ~1.16 million messages/second
Peak multiplier: 3x (evening hours globally)
Peak MPS: ~3.5 million messages/second
Each message triggers:
βββ Sender acknowledgment: 1 message
βββ Recipient delivery: 1 message
βββ Read receipt: 1 message
βββ Total events/message: ~3-4x
Effective events/second: ~10-15 million at peak
Connection Calculations
CONCURRENT CONNECTIONS
Daily active users: 500 million
Peak concurrent: ~10% = 50 million connections
If each server handles 2 million connections:
βββ Servers needed: 25 servers (just for connections!)
βββ With redundancy: 50-75 servers
WhatsApp achieved 2-3 million connections per server.
Most companies: 50,000-100,000 per server.
This 20-40x efficiency is the key insight.
Storage Calculations
MESSAGE STORAGE
Average message size: ~500 bytes (encrypted)
Daily messages: 100 billion
Daily storage: 50 TB/day (if we stored everything)
BUT: WhatsApp deletes messages after delivery!
βββ Undelivered at any time: ~5% of daily messages
βββ Actual storage: ~2.5 TB active
βββ Plus offline queues: ~10 TB
Media storage is separate:
βββ Daily uploads: ~10 billion media files
βββ Average size: 100 KB (compressed)
βββ Daily media: ~1 PB/day
βββ 30-day retention: ~30 PB
βββ With deduplication: ~15 PB
Phase 3: High-Level Architecture
You: "WhatsApp's architecture is remarkably simple. That's the secret β simplicity enables a small team."
The WhatsApp Philosophy
WHATSAPP'S ENGINEERING PRINCIPLES
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β 1. FOCUS β
β One thing, done exceptionally well: messaging. β
β No social feed, no stories (initially), no ads. β
β Feature creep is the enemy of reliability. β
β β
β 2. RELIABILITY OVER FEATURES β
β A message that doesn't deliver is worse than no feature. β
β Every new feature adds complexity and potential failures. β
β Say no to most feature requests. β
β β
β 3. RIGHT TOOL FOR THE JOB β
β Erlang for concurrency (designed for telecom switches). β
β FreeBSD for networking performance. β
β Mnesia for in-memory distributed data. β
β Unconventional, but perfect fit. β
β β
β 4. SMALL TEAM = BETTER COMMUNICATION β
β 32 engineers means everyone knows the whole system. β
β Every line of code reviewed by founders (initially). β
β No coordination overhead between 50 microservice teams. β
β β
β 5. LET IT CRASH β
β Erlang philosophy: processes crash, supervisors restart. β
β Don't try to handle every error β let it fail and recover. β
β Simple, predictable failure recovery. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
System Architecture
WHATSAPP ARCHITECTURE
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β CLIENT LAYER β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Android β β iOS β β WhatsApp β β
β β App β β App β β Web β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β
β ββββββββββββββββββΌβββββββββββββββββ β
β β β
β β (Persistent TCP + TLS) β
β β (Custom binary protocol) β
β β β
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β EDGE / GATEWAY β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LOAD BALANCERS β β
β β (GeoDNS routes to nearest data center) β β
β ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CHAT SERVERS β β
β β (Erlang + FreeBSD) β β
β β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β Chat 1 β β Chat 2 β β Chat 3 β β Chat N β β β
β β β β β β β β β β β β
β β β 2-3M β β 2-3M β β 2-3M β β 2-3M β β β
β β β conns β β conns β β conns β β conns β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β β
β β Each connection = 1 Erlang process (~2KB memory) β β
β β Process handles: auth, presence, message routing β β
β β β β
β ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ β
β β β
βββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β BACKEND SERVICES β
β β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β
β β MESSAGE β β OFFLINE β β MEDIA β β
β β ROUTING β β STORE β β STORAGE β β
β β β β β β β β
β β Direct P2P β β Queue for β β S3-like β β
β β between β β offline β β blob store β β
β β chat servers β β users β β for media β β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β
β β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β
β β USER β β GROUP β β KEYS β β
β β REGISTRY β β METADATA β β (for E2E) β β
β β β β β β β β
β β UserβServer β β Group β β Public key β β
β β mapping β β membership β β distribution β β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β
β β
β All backed by: Mnesia (Erlang distributed DB) + ETS (in-memory) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Message Flow
MESSAGE FLOW: ALICE SENDS "HELLO" TO BOB
Alice's Phone WhatsApp Servers Bob's Phone
β β β
β β ENCRYPT β β
β Message encrypted β β
β with Bob's public key β β
β (Signal Protocol) β β
β β β
β β‘ SEND β β
β ββββββββββββββββββββββββββΆ β
β Encrypted blob + β β
β recipient: bob@wa β β
β β β
β β’ ACK (SENT β) β β
β ββββββββββββββββββββββββββ β
β Server received message β β
β (Single tick) β β
β β β
β β β£ LOOKUP β
β β Where is Bob connected? β
β β User registry β Server 7 β
β β β
β β β
β IF BOB IS ONLINE β
β β β
β β β€ ROUTE β
β β ββββββββββββββββββββββββββΆβ
β β Forward to Bob's process β
β β β
β β β₯ DELIVERED ACK β
β β βββββββββββββββββββββββββββ
β β Bob's device got it β
β β β
β β¦ DELIVERED (ββ) β β
β ββββββββββββββββββββββββββ β
β (Double tick) β β
β β β
β β β
β BOB READS THE MESSAGE β
β β β
β β β§ READ RECEIPT β
β β βββββββββββββββββββββββββββ
β β Bob opened the chat β
β β β
β β¨ READ (ββ blue) β β
β ββββββββββββββββββββββββββ β
β (Blue double tick) β β
β β β
βΌ βΌ βΌ
KEY INSIGHT: Server NEVER decrypts the message.
It only routes the encrypted blob and tracks delivery status.
Phase 4: Deep Dives
Deep Dive 1: Why Erlang?
Week 2 concepts: Failure handling, resilience. Week 10 concepts: Operational simplicity.
You: "WhatsApp's secret weapon is Erlang β a language designed in the 1980s for telephone switches. It's a perfect fit for messaging."
The Challenge:
THE CONCURRENCY PROBLEM
Traditional approach (Java/Python/Go):
βββ Each connection = 1 OS thread
βββ OS thread = ~1MB memory overhead
βββ Context switching is expensive
βββ 1 million connections = 1 million threads = 1TB RAM!
βββ Practical limit: ~50,000-100,000 connections/server
WhatsApp needs:
βββ 2-3 million connections per server
βββ Each connection needs state (session, queue)
βββ Must handle failures gracefully
βββ Hot code reloading (no restarts for deploys)
Why Erlang Solves This:
ERLANG'S SUPERPOWERS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β 1. LIGHTWEIGHT PROCESSES β
β βββββββββββββββββββββββββ β
β β
β Erlang processes are NOT OS threads. β
β They're managed by the BEAM virtual machine. β
β β
β OS Thread: Erlang Process: β
β β’ ~1MB memory β’ ~2KB memory β
β β’ Expensive switch β’ Cheap switch β
β β’ Limited count β’ Millions possible β
β β
β WhatsApp runs 1 Erlang process per user connection. β
β 2 million connections = 2 million processes = ~4GB RAM β
β (vs 2TB with OS threads) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 2. "LET IT CRASH" PHILOSOPHY β
β ββββββββββββββββββββββββββββ β
β β
β Traditional: Try to handle every error (complex, error-prone) β
β Erlang: Let processes crash, supervisors restart them β
β β
β βββββββββββββββββββ β
β β SUPERVISOR β (monitors children) β
β ββββββββββ¬βββββββββ β
β β β
β ββββββββββββββββΌβββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββ βββββββββ βββββββββ β
β βUser A β βUser B β βUser C β β
β βProcessβ βProcessβ βProcessβ β
β βββββββββ βββββββββ βββββββββ β
β β β
β β CRASH! β
β βΌ β
β βββββββββ β
β βUser B β β Supervisor restarts β
β βProcessβ in milliseconds β
β βββββββββ β
β β
β User A and C never notice. User B reconnects automatically. β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 3. HOT CODE RELOADING β
β βββββββββββββββββββββ β
β β
β Deploy new code WITHOUT restarting the server. β
β Connections stay alive during deployments. β
β No maintenance windows needed. β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 4. DISTRIBUTED BY DESIGN β
β βββββββββββββββββββββββββ β
β β
β Erlang processes can communicate across machines seamlessly. β
β Send a message to a process on another server = same syntax. β
β Built-in distribution, no external message queue needed. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
WhatsApp's Erlang Customizations:
# Pseudocode: How WhatsApp structures connection handling
# (Actual WhatsApp code is Erlang, this shows the concept)
"""
WhatsApp Connection Handler
Each user connection runs as a single Erlang process.
This process:
1. Owns the TCP socket
2. Maintains user session state
3. Routes messages to/from the user
4. Dies cleanly when user disconnects
"""
class UserConnectionProcess:
"""
Conceptual model of WhatsApp's per-user Erlang process.
In Erlang, this is implemented with gen_server behavior.
Memory: ~2KB per process
Lifecycle: Lives as long as user is connected
"""
def __init__(self, socket, user_id):
self.socket = socket
self.user_id = user_id
self.session_key = None
self.message_queue = [] # For messages while processing
self.last_seen = now()
# Register this process in the global user registry
UserRegistry.register(user_id, self.pid)
def handle_message(self, msg):
"""
Main message handler β called for every incoming message.
Erlang processes have a mailbox β messages queue up
and are processed one at a time.
"""
match msg:
case AuthRequest(credentials):
self.handle_auth(credentials)
case OutgoingMessage(recipient, encrypted_content):
self.route_to_recipient(recipient, encrypted_content)
case IncomingMessage(sender, encrypted_content):
self.deliver_to_socket(sender, encrypted_content)
case PresenceUpdate(status):
self.broadcast_presence(status)
case Disconnect():
self.cleanup_and_die()
def route_to_recipient(self, recipient_id, content):
"""
Route message to another user.
This is the core of WhatsApp's routing:
1. Look up recipient's process (which server?)
2. Send message directly to that process
"""
# Find recipient's process (could be on any server)
recipient_process = UserRegistry.lookup(recipient_id)
if recipient_process:
# Direct process-to-process communication
# Works across servers in Erlang!
send(recipient_process, IncomingMessage(
sender=self.user_id,
content=content
))
self.send_ack(MessageStatus.DELIVERED)
else:
# User offline β store for later
OfflineStore.queue(recipient_id, content)
self.send_ack(MessageStatus.SENT)
def cleanup_and_die(self):
"""
Clean shutdown when user disconnects.
In Erlang, when a process dies, all its memory
is automatically reclaimed. No memory leaks possible.
"""
UserRegistry.unregister(self.user_id)
close(self.socket)
# Process simply terminates β Erlang handles cleanup
The "Island" Architecture:
WHATSAPP'S ISLAND ARCHITECTURE
To limit blast radius of failures, WhatsApp partitions
servers into independent "islands."
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ISLAND 1 ISLAND 2 ISLAND 3 β
β β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
β β Chat Servers β β Chat Servers β β Chat Servers β β
β β 1, 2, 3, 4 β β 5, 6, 7, 8 β β 9, 10, 11 β β
β β β β β β β β
β β Users A-H β β Users I-P β β Users Q-Z β β
β β β β β β β β
β β Replica: 1a β β Replica: 2a β β Replica: 3a β β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
β β β β β
β β β β β
β ββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ β
β β β
β βΌ β
β Cross-Island Routing β
β (for AβZ messages) β
β β
β BENEFITS: β
β β’ Island 1 can fail without affecting Island 2 β
β β’ Upgrades can be rolled out island by island β
β β’ Each island has its own replica for failover β
β β’ Limits "blast radius" of any failure β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Deep Dive 2: End-to-End Encryption (Signal Protocol)
Week 9 concepts: Security, encryption.
You: "WhatsApp uses the Signal Protocol for end-to-end encryption. This is critical β even WhatsApp can't read your messages."
END-TO-END ENCRYPTION OVERVIEW
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β WITHOUT E2EE: β
β βββββββββββββ β
β β
β Alice ββ"Hi"βββΆ Server ββ"Hi"βββΆ Bob β
β β β
β β Server can read "Hi" β
β β Hackers who breach server can read "Hi" β
β β Government subpoenas can access "Hi" β
β β
β WITH E2EE: β
β ββββββββββ β
β β
β Alice ββ[encrypted blob]βββΆ Server ββ[encrypted blob]βββΆ Bob β
β β β
β β Server sees random bytes β
β β Only Alice and Bob can decrypt β
β β Even WhatsApp can't read it β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Signal Protocol Components:
SIGNAL PROTOCOL: THE BUILDING BLOCKS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β 1. KEY PAIRS β
β βββββββββββββ β
β β
β Each user has multiple key pairs: β
β β
β Identity Key (long-term): β
β β’ Generated once when you install WhatsApp β
β β’ Never changes (unless you reinstall) β
β β’ Public part uploaded to WhatsApp servers β
β β
β Signed Pre-Key (medium-term): β
β β’ Rotates periodically β
β β’ Signed by Identity Key β
β β’ Uploaded to servers β
β β
β One-Time Pre-Keys (ephemeral): β
β β’ Batch of ~100 keys uploaded β
β β’ Each used once then discarded β
β β’ Provides forward secrecy β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 2. X3DH KEY AGREEMENT β
β βββββββββββββββββββββ β
β β
β When Alice messages Bob for the first time: β
β β
β Alice Server Bob β
β β β β β
β β "Give me Bob's keys" β β β
β ββββββββββββββββββββββββββββββΆβ β β
β β β β β
β β Bob's: Identity Key β β β
β β Signed Pre-Key β β β
β β One-Time Pre-Key β β β
β βββββββββββββββββββββββββββββββ β β
β β β β β
β β β β
β β Alice computes shared secret using: β β
β β β’ Her Identity Key + Bob's Signed Pre-Key β β
β β β’ Her Identity Key + Bob's One-Time Pre-Key β β
β β β’ Her Ephemeral Key + Bob's Identity Key β β
β β β’ Her Ephemeral Key + Bob's Signed Pre-Key β β
β β β β
β β β Shared Secret (only Alice and Bob can compute) β β
β β β β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 3. DOUBLE RATCHET β
β βββββββββββββββββ β
β β
β After initial key agreement, keys evolve with EVERY message: β
β β
β Message 1: Encrypt with Key_1 β
β Message 2: Encrypt with Key_2 (derived from Key_1) β
β Message 3: Encrypt with Key_3 (derived from Key_2) β
β ... β
β β
β Benefits: β
β β’ Forward Secrecy: Compromise of Key_3 doesn't reveal Msg 1-2 β
β β’ Post-Compromise Security: If key leaked, future keys are safe β
β β’ Break-in Recovery: System "heals" itself over time β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# encryption/signal_protocol.py
"""
Simplified illustration of Signal Protocol concepts.
The actual implementation uses:
- Curve25519 for key exchange
- AES-256-CBC for encryption
- HMAC-SHA256 for authentication
"""
from dataclasses import dataclass
from typing import Optional, Tuple
import os
import hashlib
import hmac
@dataclass
class KeyPair:
"""A public/private key pair."""
public_key: bytes
private_key: bytes
@dataclass
class SessionState:
"""State for an encrypted session with another user."""
their_identity_key: bytes
their_signed_prekey: bytes
# Ratchet state
root_key: bytes
chain_key: bytes
message_number: int = 0
class SignalSession:
"""
Simplified Signal Protocol session.
This illustrates the key concepts β actual implementation
is more complex with proper cryptographic primitives.
"""
def __init__(
self,
my_identity: KeyPair,
their_identity_public: bytes,
their_signed_prekey: bytes,
their_one_time_prekey: Optional[bytes] = None
):
self.my_identity = my_identity
# Perform X3DH key agreement
self.session = self._x3dh_key_agreement(
their_identity_public,
their_signed_prekey,
their_one_time_prekey
)
def _x3dh_key_agreement(
self,
their_identity: bytes,
their_signed_prekey: bytes,
their_one_time_prekey: Optional[bytes]
) -> SessionState:
"""
X3DH: Extended Triple Diffie-Hellman.
Creates a shared secret from multiple DH exchanges.
"""
# Generate ephemeral key for this session
ephemeral = self._generate_keypair()
# Perform multiple DH exchanges
dh1 = self._dh(self.my_identity.private_key, their_signed_prekey)
dh2 = self._dh(ephemeral.private_key, their_identity)
dh3 = self._dh(ephemeral.private_key, their_signed_prekey)
if their_one_time_prekey:
dh4 = self._dh(ephemeral.private_key, their_one_time_prekey)
shared_secret = self._kdf(dh1 + dh2 + dh3 + dh4)
else:
shared_secret = self._kdf(dh1 + dh2 + dh3)
# Derive initial root and chain keys
root_key, chain_key = self._derive_keys(shared_secret)
return SessionState(
their_identity_key=their_identity,
their_signed_prekey=their_signed_prekey,
root_key=root_key,
chain_key=chain_key
)
def encrypt(self, plaintext: bytes) -> Tuple[bytes, int]:
"""
Encrypt a message using current chain key.
Each message gets a unique key (Double Ratchet).
"""
# Derive message key from chain key
message_key, new_chain_key = self._ratchet_chain(
self.session.chain_key
)
# Encrypt with message key
ciphertext = self._aes_encrypt(message_key, plaintext)
# Update chain key (ratchet forward)
message_number = self.session.message_number
self.session.chain_key = new_chain_key
self.session.message_number += 1
return ciphertext, message_number
def decrypt(self, ciphertext: bytes, message_number: int) -> bytes:
"""
Decrypt a message.
May need to skip ahead if messages arrive out of order.
"""
# Derive the message key for this message number
message_key = self._get_message_key(message_number)
# Decrypt
plaintext = self._aes_decrypt(message_key, ciphertext)
return plaintext
def _ratchet_chain(self, chain_key: bytes) -> Tuple[bytes, bytes]:
"""
Symmetric ratchet: derive next keys from current.
This is the "forward secrecy" mechanism β old keys
can't be derived from new keys.
"""
# Message key for current message
message_key = hmac.new(chain_key, b"\x01", hashlib.sha256).digest()
# New chain key for next message
new_chain_key = hmac.new(chain_key, b"\x02", hashlib.sha256).digest()
return message_key, new_chain_key
class GroupEncryption:
"""
Group messaging uses a different approach.
Instead of pairwise encryption, sender distributes
a "sender key" to all group members.
"""
def __init__(self, group_id: str, my_user_id: str):
self.group_id = group_id
self.my_user_id = my_user_id
# Generate my sender key for this group
self.my_sender_key = self._generate_sender_key()
# Store other members' sender keys
self.member_sender_keys = {}
def distribute_sender_key(self, member_sessions: dict):
"""
Send my sender key to all group members.
Uses pairwise Signal sessions for secure distribution.
"""
for member_id, session in member_sessions.items():
# Encrypt sender key using pairwise session
encrypted_key = session.encrypt(self.my_sender_key)
# Send to member via normal message path
yield member_id, encrypted_key
def encrypt_for_group(self, plaintext: bytes) -> bytes:
"""
Encrypt message using my sender key.
All group members who have my sender key can decrypt.
Much more efficient than pairwise encryption.
"""
# Use sender key with chain ratchet
message_key = self._derive_message_key()
ciphertext = self._aes_encrypt(message_key, plaintext)
# Ratchet forward
self._ratchet_sender_key()
return ciphertext
def handle_member_left(self, member_id: str):
"""
When someone leaves, all sender keys must be regenerated.
This ensures the departed member can't read future messages.
"""
# Generate new sender key
self.my_sender_key = self._generate_sender_key()
# Redistribute to remaining members
# (handled by caller)
What WhatsApp CAN See (Metadata):
ENCRYPTION SCOPE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ENCRYPTED (WhatsApp cannot read): β
β βββββββββββββββββββββββββββββββββ β
β β’ Message content β
β β’ Media files (photos, videos, documents) β
β β’ Voice notes β
β β’ Voice and video call content β
β β’ Status updates β
β β
β NOT ENCRYPTED (WhatsApp can see): β
β βββββββββββββββββββββββββββββββββ β
β β’ Who messaged whom (sender/recipient) β
β β’ When messages were sent β
β β’ Message size β
β β’ Device information β
β β’ Phone numbers β
β β’ Profile photos β
β β’ Group membership β
β β’ Online/offline status β
β β
β This metadata is necessary for routing but reveals communication β
β patterns even if content is secret. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Deep Dive 3: Offline Message Handling
Week 3 concepts: Message queuing, delivery guarantees.
You: "Most users aren't online all the time. Offline message storage is critical."
OFFLINE MESSAGE FLOW
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Alice sends message to offline Bob β
β β
β Alice WhatsApp Bob β
β β β β β
β β β Send (encrypted) β β (offline) β
β βββββββββββββββββββββββββββΆβ β β
β β β β β
β β β‘ ACK (single β) β β β
β ββββββββββββββββββββββββββββ β β
β β "Message received β β β
β β by server" β β β
β β β β β
β β β’ Store in β β
β β Offline Queue β β
β β βββββββββββ β β
β β β Bob's β β β
β β β Queue β β β
β β β βββββββ β β β
β β β Msg 1 β β β
β β β Msg 2 β β β
β β β ... β β β
β β βββββββββββ β β
β β β β β
β β β β β
β β ... TIME PASSES (Bob offline) ... β β
β β β β β
β β β β β
β β β β£ Bob comes online β β
β β ββββββββββββββββββββββββββββ β
β β β TCP connection β β
β β β β β
β β β β€ Drain queue β β
β β βββββββββββββββββββββββββββΆβ β
β β β Send all pending msgs β β
β β β β β
β β β₯ Delivered (ββ) β β¦ Delivery ACK β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β β
β β β§ Delete from β β
β β queue (delivered) β β
β β β β β
β βΌ βΌ βΌ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# offline/message_store.py
"""
Offline Message Storage
WhatsApp's approach:
- Messages stored only until delivered
- After delivery confirmation, deleted from server
- This minimizes server storage and privacy exposure
"""
from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime, timedelta
@dataclass
class QueuedMessage:
"""A message waiting for delivery."""
message_id: str
sender_id: str
recipient_id: str
encrypted_content: bytes
timestamp: datetime
expiry: datetime # Messages expire after 30 days
# Delivery tracking
delivery_attempts: int = 0
last_attempt: Optional[datetime] = None
class OfflineMessageStore:
"""
Stores messages for offline users.
Design principles:
- In-memory for speed (Mnesia ETS tables)
- Replicated for durability
- Partitioned by recipient for scalability
- Messages deleted after delivery
"""
def __init__(self, db, max_queue_size: int = 10000):
self.db = db # Mnesia in actual WhatsApp
self.max_queue_size = max_queue_size
# TTL for stored messages
self.message_ttl = timedelta(days=30)
async def queue_message(
self,
sender_id: str,
recipient_id: str,
encrypted_content: bytes
) -> QueuedMessage:
"""
Store message for offline recipient.
"""
message = QueuedMessage(
message_id=generate_id(),
sender_id=sender_id,
recipient_id=recipient_id,
encrypted_content=encrypted_content,
timestamp=datetime.utcnow(),
expiry=datetime.utcnow() + self.message_ttl
)
# Check queue size limits
queue_size = await self.db.count_for_recipient(recipient_id)
if queue_size >= self.max_queue_size:
# Drop oldest message (or reject new one)
await self.db.delete_oldest(recipient_id)
await self.db.save(message)
return message
async def get_pending_messages(
self,
recipient_id: str
) -> List[QueuedMessage]:
"""
Get all pending messages for a user who just came online.
Called when user establishes connection.
"""
messages = await self.db.get_by_recipient(
recipient_id,
order_by="timestamp",
limit=1000 # Batch size
)
# Filter expired messages
now = datetime.utcnow()
valid_messages = [m for m in messages if m.expiry > now]
return valid_messages
async def mark_delivered(
self,
message_ids: List[str]
):
"""
Delete messages that have been delivered.
This is called after receiving delivery ACK from recipient.
"""
for message_id in message_ids:
await self.db.delete(message_id)
async def cleanup_expired(self):
"""
Background job: Delete expired messages.
Runs periodically to clean up messages that
were never delivered (user never came back online).
"""
now = datetime.utcnow()
expired = await self.db.get_expired(now)
for message in expired:
await self.db.delete(message.message_id)
# Optionally notify sender that message expired
# (WhatsApp doesn't do this)
class WriteBackCache:
"""
WhatsApp uses write-back caching for offline messages.
- Messages stored in memory first (fast)
- Asynchronously persisted to disk
- Handles disk I/O slowdowns gracefully
"""
def __init__(self, memory_store, disk_store):
self.memory = memory_store # ETS table
self.disk = disk_store # Mnesia disk table
self.pending_writes = []
async def store(self, message: QueuedMessage):
"""
Store in memory immediately, persist async.
"""
# Immediate: memory storage
await self.memory.put(message.message_id, message)
# Async: queue for disk persistence
self.pending_writes.append(message)
# Trigger background write (non-blocking)
asyncio.create_task(self._flush_writes())
async def _flush_writes(self):
"""
Background: persist to disk.
Batched for efficiency.
"""
if len(self.pending_writes) >= 100:
batch = self.pending_writes[:100]
self.pending_writes = self.pending_writes[100:]
await self.disk.batch_write(batch)
Deep Dive 4: Presence and Typing Indicators
Week 4 concepts: Caching, real-time updates.
You: "Presence is the 'chattiest' part of the system β far more traffic than messages."
PRESENCE TRAFFIC ANALYSIS
For every message sent, there are ~10-100 presence events:
βββ User opens app (online)
βββ User starts typing (typing indicator)
βββ User stops typing
βββ User reads messages (last seen update)
βββ User closes app (offline)
βββ Repeated for every chat participant watching
With 500M daily active users:
βββ Messages: ~100 billion/day
βββ Presence events: ~1 trillion/day
βββ This is 10x the message traffic!
Presence CANNOT go through the same path as messages.
It needs aggressive optimization.
# presence/manager.py
"""
Presence Management
Optimizations WhatsApp uses:
1. Batching: Aggregate multiple presence updates
2. Rate limiting: Don't send typing every keystroke
3. TTL: Presence expires automatically
4. Selective delivery: Only send to active chats
"""
from dataclasses import dataclass
from typing import Dict, Set, Optional
from datetime import datetime, timedelta
from enum import Enum
class PresenceState(Enum):
ONLINE = "online"
OFFLINE = "offline"
TYPING = "typing"
@dataclass
class UserPresence:
"""Current presence state for a user."""
user_id: str
state: PresenceState
last_seen: datetime
typing_in_chat: Optional[str] = None
class PresenceManager:
"""
Manages presence state and notifications.
Key optimizations:
1. In-memory only (no persistence needed)
2. TTL-based expiry (online expires in 60s)
3. Rate limiting (typing indicator throttled)
4. Subscription-based (only notify interested parties)
"""
def __init__(self, redis_cluster, notification_service):
self.redis = redis_cluster
self.notifier = notification_service
# TTLs
self.online_ttl = timedelta(seconds=60)
self.typing_ttl = timedelta(seconds=5)
async def set_online(self, user_id: str):
"""
Mark user as online.
Uses Redis with TTL β automatically expires to offline.
"""
presence = UserPresence(
user_id=user_id,
state=PresenceState.ONLINE,
last_seen=datetime.utcnow()
)
# Store with TTL (auto-expires to offline)
await self.redis.set(
f"presence:{user_id}",
presence.to_json(),
ex=int(self.online_ttl.total_seconds())
)
# Notify subscribers
await self._notify_subscribers(user_id, presence)
async def set_typing(self, user_id: str, chat_id: str):
"""
Mark user as typing in a specific chat.
Rate limited: max 1 update per second per chat.
"""
# Rate limit check
rate_key = f"typing_rate:{user_id}:{chat_id}"
if await self.redis.exists(rate_key):
return # Already sent recently, skip
# Set rate limit marker (1 second)
await self.redis.set(rate_key, "1", ex=1)
# Update typing state
await self.redis.set(
f"typing:{user_id}:{chat_id}",
"1",
ex=int(self.typing_ttl.total_seconds())
)
# Notify only the other participant(s) of this chat
await self._notify_chat_participants(
chat_id,
user_id,
PresenceState.TYPING
)
async def get_presence(self, user_id: str) -> Optional[UserPresence]:
"""
Get current presence for a user.
Used when opening a chat to show current status.
"""
data = await self.redis.get(f"presence:{user_id}")
if data:
return UserPresence.from_json(data)
else:
# No presence record = offline
# Get last seen from persistent storage
last_seen = await self._get_last_seen(user_id)
return UserPresence(
user_id=user_id,
state=PresenceState.OFFLINE,
last_seen=last_seen
)
async def _notify_subscribers(
self,
user_id: str,
presence: UserPresence
):
"""
Notify users who care about this user's presence.
Optimization: Only notify users who have this user's
chat currently open (not all contacts).
"""
# Get list of users with this chat open
subscribers = await self.redis.smembers(
f"presence_subscribers:{user_id}"
)
for subscriber_id in subscribers:
await self.notifier.send_presence_update(
subscriber_id,
presence
)
async def subscribe_to_presence(
self,
subscriber_id: str,
target_user_id: str
):
"""
Subscribe to presence updates for a user.
Called when user opens a chat.
"""
await self.redis.sadd(
f"presence_subscribers:{target_user_id}",
subscriber_id
)
# Auto-expire subscription after 10 minutes
await self.redis.expire(
f"presence_subscribers:{target_user_id}",
600
)
async def unsubscribe_from_presence(
self,
subscriber_id: str,
target_user_id: str
):
"""
Unsubscribe from presence updates.
Called when user closes a chat.
"""
await self.redis.srem(
f"presence_subscribers:{target_user_id}",
subscriber_id
)
Phase 5: Scaling and Edge Cases
Interviewer: "WhatsApp grew from 0 to 500 million users in 5 years. How did they scale without rewriting everything?"
You: "Erlang's design made horizontal scaling almost automatic. But they still faced challenges..."
WhatsApp's Scaling Journey
WHATSAPP SCALING TIMELINE
2009: Launch
βββ Few servers
βββ Basic XMPP protocol
βββ Thousands of users
2011: 1 million users
βββ Custom Erlang backend
βββ Modified ejabberd
βββ ~10 servers
2013: 200 million users
βββ 2 million connections per server achieved
βββ Custom BEAM patches
βββ ~100 servers
2014 (acquisition): 500 million users
βββ 50 billion messages/day
βββ 550 servers
βββ 32 backend engineers
2020: 2 billion users
βββ 100+ billion messages/day
βββ Thousands of servers
βββ Still running on Erlang
KEY INSIGHT:
The architecture scaled 2000x (1M β 2B users) with
the same fundamental design. That's the power of
choosing the right tool for the job.
Edge Cases
EDGE CASE 1: Celebrity with 10M Followers Posts Status
Problem: Status update needs to reach 10 million people
Impact: Massive fan-out, could overwhelm system
Solution:
βββ Status updates are pull-based, not push
βββ User's device polls for updates when opening status tab
βββ Updates cached at edge servers
βββ Same update served to millions from cache
βββ No per-follower push notification
EDGE CASE 2: Group with 1000 Members
Problem: Each message needs 1000 deliveries
Impact: 1 message = 1000x amplification
Solution:
βββ Sender Key protocol (not pairwise encryption)
βββ Message encrypted once, decrypted by all members
βββ Group metadata (membership) changes trigger rekey
βββ Stagger delivery to avoid thundering herd
βββ Members who leave can't read future messages
EDGE CASE 3: Network Partition
Problem: User connected to Server A, recipient on Server B
Network between A and B fails
Solution:
βββ Messages queued at Server A
βββ Periodic retry to deliver to Server B
βββ If partition heals, messages delivered
βββ If partition persists, messages expire after 30 days
βββ User notified of delivery failures after timeout
EDGE CASE 4: Device Goes Underwater (Literally)
Problem: User's phone destroyed, never comes back online
Impact: Their offline queue grows forever
Solution:
βββ Messages expire after 30 days
βββ Expired messages deleted from server
βββ Sender never gets "delivered" status
βββ No infinite queue growth
βββ New device = new session, old messages lost
Phase 6: Why 50 Engineers?
You: "The small team is the most impressive part. Here's why it worked..."
WHY 50 ENGINEERS COULD BUILD WHATSAPP
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β 1. TECHNOLOGY CHOICES REDUCED COMPLEXITY β
β βββββββββββββββββββββββββββββββββββββββ β
β β
β Erlang's actor model: β
β βββ No mutex/lock management (huge complexity source) β
β βββ Built-in distribution (no Kafka/RabbitMQ to manage) β
β βββ Hot code reloading (no complex deployment pipelines) β
β βββ Supervision trees (automatic failure recovery) β
β β
β FreeBSD: β
β βββ Simpler than Linux (one distribution) β
β βββ Better network stack for their use case β
β βββ Team had deep expertise (ex-Yahoo) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 2. FOCUS REDUCED SCOPE β
β ββββββββββββββββββββββ β
β β
β What WhatsApp DIDN'T build: β
β βββ Advertising system (no ads = no ad tech team) β
β βββ Social feed (no feed ranking team) β
β βββ Recommendation engine (no ML team) β
β βββ Third-party integrations (no platform team) β
β βββ Multiple products (no product sprawl) β
β βββ Enterprise features (initially) β
β β
β Every feature NOT built = engineers NOT needed β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 3. COMMUNICATION OVERHEAD IS QUADRATIC β
β βββββββββββββββββββββββββββββββββββββ β
β β
β Team of 50: 50 Γ 49 / 2 = 1,225 communication paths β
β Team of 500: 500 Γ 499 / 2 = 124,750 paths β
β β
β Larger teams spend more time coordinating than coding. β
β Small teams move FASTER for well-scoped problems. β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 4. DEEP EXPERTISE > BROAD KNOWLEDGE β
β βββββββββββββββββββββββββββββββββββ β
β β
β Every engineer knew: β
β βββ The entire codebase β
β βββ Erlang deeply (not just superficially) β
β βββ FreeBSD kernel internals β
β βββ The complete message flow β
β β
β No "that's not my service" mentality. β
β Anyone could debug any issue. β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 5. CODE REVIEW BY FOUNDERS β
β ββββββββββββββββββββββββββ β
β β
β In early days, Jan Koum reviewed every line of backend code. β
β This ensured: β
β βββ Consistent style β
β βββ No unnecessary complexity β
β βββ Deep understanding by leadership β
β βββ Quality over velocity β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interview Conclusion
Interviewer: "Excellent coverage. Quick questions:"
Interviewer: "If you were starting today, would you still use Erlang?"
You: "It depends. Erlang is still excellent for this use case, but:
Advantages of Erlang:
- Perfect for massive concurrency
- Battle-tested at WhatsApp scale
- 'Let it crash' philosophy reduces code complexity
Disadvantages:
- Smaller talent pool (hard to hire)
- Less library ecosystem than Go/Rust
- Learning curve for new engineers
Modern alternatives:
- Go: Good concurrency, larger talent pool
- Rust: Performance, growing ecosystem
- Elixir: Erlang VM with nicer syntax
For a startup, I'd probably choose Go for the talent pool. For WhatsApp's specific requirements, Erlang is still hard to beat."
Interviewer: "What's the hardest part of building a messaging system?"
You: "Three things:
-
Exactly-once delivery semantics β Messages should arrive once, in order, or clearly fail. Network unreliability makes this hard.
-
End-to-end encryption at scale β Key management, device verification, group re-keying. Most teams underestimate this complexity.
-
Presence at scale β The 'chattiest' part of the system. Easy to ignore, hard to scale. Most outages are presence-related."
Summary: Concepts Applied from 10-Week Course
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β CONCEPTS FROM 10-WEEK COURSE IN WHATSAPP DESIGN β
β β
β WEEK 1: DATA AT SCALE β
β βββ Partitioning: Users partitioned across "islands" β
β βββ Replication: Each island has a replica β
β βββ In-memory storage: ETS/Mnesia for speed β
β β
β WEEK 2: FAILURE-FIRST DESIGN β
β βββ Let it crash: Erlang supervision trees β
β βββ Process isolation: One crash doesn't affect others β
β βββ Circuit breakers: Island isolation β
β βββ Retry: Message redelivery for transient failures β
β β
β WEEK 3: MESSAGING & ASYNC β
β βββ Offline queues: Messages stored until delivered β
β βββ At-least-once delivery: With deduplication at client β
β βββ Backpressure: Queue size limits β
β β
β WEEK 4: CACHING β
β βββ Presence caching: In-memory with TTL β
β βββ Write-back cache: For offline message persistence β
β βββ Edge caching: Media and status updates β
β β
β WEEK 5: CONSISTENCY β
β βββ Eventual consistency: Presence can be stale β
β βββ At-least-once: Messages, with client dedup β
β βββ Message ordering: Per-conversation, not global β
β β
β WEEK 9: SECURITY β
β βββ End-to-end encryption: Signal Protocol β
β βββ Forward secrecy: Double Ratchet algorithm β
β βββ Key management: X3DH key agreement β
β βββ Zero-knowledge: Server can't read messages β
β β
β WEEK 10: OPERATIONS β
β βββ Hot code reloading: Deploy without restart β
β βββ Observability: Per-process stats in BEAM β
β βββ Small team: Simplicity enables velocity β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why WhatsApp Matters
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β WHY WHATSAPP IS AN ENGINEERING MARVEL β
β β
β EFFICIENCY β
β ββββββββββ β
β β’ 2 million connections per server (20x industry norm) β
β β’ 50 billion messages/day with 32 engineers β
β β’ $19 billion valuation with 55 employees β
β β
β RELIABILITY β
β βββββββββββ β
β β’ 99.99%+ uptime β
β β’ Messages reliably delivered worldwide β
β β’ Works on 2G networks in developing countries β
β β
β PRIVACY β
β βββββββ β
β β’ End-to-end encryption by default β
β β’ Even WhatsApp can't read your messages β
β β’ Signal Protocol: Gold standard for secure messaging β
β β
β SIMPLICITY β
β ββββββββββ β
β β’ No ads, no social feed, no distractions β
β β’ One feature done exceptionally well β
β β’ Focus enabled small team to move fast β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β "The best systems are simple enough that a small team can β
β understand them completely, yet powerful enough to serve β
β billions of users." β
β β
β WhatsApp proved that architecture beats headcount. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Self-Assessment Checklist
After studying this case study, you should be able to:
Architecture:
- Explain why Erlang is well-suited for messaging
- Design a connection handling system for millions of users
- Implement offline message queuing and delivery
Distributed Systems:
- Apply the "let it crash" philosophy
- Design supervision trees for failure recovery
- Handle network partitions gracefully
Security:
- Explain Signal Protocol at a high level
- Understand forward secrecy and its importance
- Design key exchange for end-to-end encryption
Performance:
- Optimize presence/typing for high frequency updates
- Use TTL-based expiry for ephemeral data
- Apply batching and rate limiting
Team Efficiency:
- Understand how technology choices affect team size
- Value simplicity over cleverness
- Focus on doing one thing well
Sources
Architecture and Scale:
- ByteByteGo - How WhatsApp Handles 40 Billion Messages: https://blog.bytebytego.com/p/how-whatsapp-handles-40-billion-messages
- GetStream - How WhatsApp Works Architecture: https://getstream.io/blog/whatsapp-works/
- High Scalability - WhatsApp Architecture Facebook Bought: https://highscalability.com/the-whatsapp-architecture-facebook-bought-for-19-billion/
- System Design One - 8 Reasons WhatsApp Scaled with 32 Engineers: https://newsletter.systemdesign.one/p/whatsapp-engineering
- Talent500 - How WhatsApp Powers 40 Billion Messages: https://talent500.com/blog/whatsapp-scalable-messaging-architecture/
- CometChat - WhatsApp Architecture and System Design: https://www.cometchat.com/blog/whatsapps-architecture-and-system-design
Erlang and FreeBSD:
- FreeBSD Foundation - WhatsApp Testimonial: https://freebsdfoundation.org/testimonial/whatsapp/
- FreeBSD Forums - WhatsApp on FreeBSD Discussion: https://forums.freebsd.org/threads/whatsapp-on-freebsd.87908/
- Erlang Factory - Rick Reed Speaker Profile: http://www.erlang-factory.com/conference/SFBay2012/speakers/RickReed
- Medium - WhatsApp Engineering Inside: https://medium.com/codingurukul/whatsapp-engineering-inside-2-bdd1ec354748
- Medium - WhatsApp's Billion-User Database with FreeBSD and Erlang: https://medium.com/@yashbatra11111/whatsapps-billion-user-database-how-freebsd-and-erlang-handled-the-impossible-5e699f7f078d
- SlideShare - WhatsApp's Architecture: https://www.slideshare.net/slideshow/whatsapps-architecture/57522146
End-to-End Encryption:
- Signal.org - WhatsApp Integration Complete: https://signal.org/blog/whatsapp-complete/
- Wikipedia - Signal Protocol: https://en.wikipedia.org/wiki/Signal_Protocol
- WhatsApp FAQ - About End-to-End Encryption: https://faq.whatsapp.com/820124435853543
- Gupshup - WhatsApp E2E Encryption Guide: https://www.gupshup.ai/resources/blog/whatsapp-end-to-end-encryption/
- Requestly Blog - WhatsApp Chat Security with E2EE: https://requestly.com/blog/how-whatsapp-ensures-chat-security-with-end-to-end-encryption/
- SDERay - Signal Protocol Engineering Deep Dive: https://sderay.com/how-end-to-end-encryption-works-the-engineering-behind-signal-whatsapp-security/
- Medium - WhatsApp E2E Encryption How It Works: https://medium.com/@panghalamit/whatsapp-s-end-to-end-encryption-how-does-it-work-80020977caa0
- MIT Course Paper - WhatsApp Security Analysis: https://courses.csail.mit.edu/6.857/2016/files/36.pdf
System Design Resources:
- GeeksforGeeks - How WhatsApp Handles 50 Billion Messages: https://www.geeksforgeeks.org/system-design/how-whatsapp-handles-50-billion-messages-a-day/
- Medium - WhatsApp System Design Complete Architecture: https://medium.com/@yadavsatale/whatsapp-system-design-a-complete-architecture-deep-dive-8949f8d4eb2b
- TRTC - Definitive Guide to WhatsApp System Design: https://trtc.io/blog/details/whatsapp-system
- Medium - System Design of WhatsApp for Android: https://medium.com/@YodgorbekKomilo/the-system-design-of-whatsapp-for-android-behind-the-scenes-of-a-global-messaging-giant-c80175b18016
- Medium - Unpacking WhatsApp's System Design: https://medium.com/@lovejot.singh/unpacking-whatsapps-system-design-the-power-behind-your-messages-316280b38f78
Further Reading
Official Documentation:
- WhatsApp Security Whitepaper: https://www.whatsapp.com/security/WhatsApp-Security-Whitepaper.pdf
- Signal Protocol Specifications: https://signal.org/docs/
- WhatsApp Engineering Blog: https://engineering.fb.com/ (Search for WhatsApp posts)
Conference Talks (Highly Recommended):
- Rick Reed - "Scaling to Millions of Simultaneous Connections" (Erlang Factory 2012)
- YouTube: Search "Rick Reed WhatsApp Erlang"
- Rick Reed - "That's 'Billion' with a 'B': Scaling to the Next Level at WhatsApp" (Erlang Factory 2014)
- Details on achieving 2M+ connections per server
- Anton Lavrik - "A Reflection on Building the WhatsApp Server" (Code BEAM 2018)
- Post-Facebook acquisition insights
- Eugene Fooksman - "WhatsApp System Design" (Various conferences)
Engineering Blogs:
- High Scalability Blog: https://highscalability.com/ (Multiple WhatsApp articles)
- ByteByteGo Newsletter: https://blog.bytebytego.com/ (System design breakdowns)
- System Design One: https://newsletter.systemdesign.one/ (Engineering case studies)
- The Pragmatic Engineer: https://newsletter.pragmaticengineer.com/ (Tech industry insights)
Erlang/BEAM Resources:
- Learn You Some Erlang: https://learnyousomeerlang.com/ (Free online book)
- Erlang.org Documentation: https://www.erlang.org/doc/
- BEAM Book: https://blog.stenmans.org/theBeamBook/ (BEAM VM internals)
- Elixir Forum: https://elixirforum.com/ (Modern Erlang VM discussions)
Cryptography and Security:
- Signal Protocol Documentation: https://signal.org/docs/
- Double Ratchet Algorithm: https://signal.org/docs/specifications/doubleratchet/
- X3DH Key Agreement Protocol: https://signal.org/docs/specifications/x3dh/
- Crypto101: https://www.crypto101.io/ (Free cryptography book)
Books:
- "Designing Data-Intensive Applications" by Martin Kleppmann
- Chapters on messaging and distributed systems
- "Programming Erlang" by Joe Armstrong (Erlang creator)
- Understanding the language that powers WhatsApp
- "Designing Elixir Systems with OTP" by James Edward Gray II
- Modern take on BEAM VM patterns
- "System Design Interview Vol 1 & 2" by Alex Xu
- Messaging system design patterns
Related Systems to Study:
- Telegram: Different architecture (MTProto protocol)
- Signal: Open-source reference implementation
- Discord: Modern take on real-time messaging
- Slack: Enterprise messaging architecture
Research Papers:
- "The Signal Protocol" - Formal security analysis papers
- "Asynchronous Ratcheting Trees" - Group messaging improvements
- "SoK: Secure Messaging" - Survey of secure messaging systems
Podcasts and Videos:
- Software Engineering Daily: Episodes on messaging systems
- InfoQ: Conference talks on distributed systems
- Strange Loop Conference: Talks on Erlang/BEAM
End of Bonus Problem 2: WhatsApp Messaging
"50 engineers. 2 billion users. 100 billion messages. The right architecture makes the impossible possible."