Himanshu Kukreja
0%

Bonus Problem 2: WhatsApp Messaging

How 50 Engineers Built the World's Most Efficient Messaging Platform


πŸ“± The Impossible Made Simple

When Facebook acquired WhatsApp for $19 billion in 2014, something didn't add up.

The messaging app was handling 50 billion messages daily β€” but with only 32 engineers on the backend.

How is that even possible?

THE WHATSAPP PARADOX (2014 Acquisition)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                          β”‚
β”‚   MESSAGES PER DAY              ENGINEERS                                β”‚
β”‚   ──────────────────            ─────────                                β”‚
β”‚   50 Billion                    32 (backend)                             β”‚
β”‚                                 50 (total)                               β”‚
β”‚                                                                          β”‚
β”‚   SERVERS                       COST PER USER                            β”‚
β”‚   ───────                       ─────────────                            β”‚
β”‚   ~550                          Fraction of competitors                  β”‚
β”‚                                                                          β”‚
β”‚   CONNECTIONS PER SERVER        TECHNOLOGY STACK                         β”‚
β”‚   ──────────────────────        ────────────────                         β”‚
β”‚   2-3 Million                   Erlang + FreeBSD                         β”‚
β”‚   (Industry: ~100K typical)     (Not Java, not Linux)                    β”‚
β”‚                                                                          β”‚
β”‚   For comparison:                                                        β”‚
β”‚   β€’ Facebook Messenger: 1000+ engineers                                  β”‚
β”‚   β€’ Most messaging apps: 100-500 engineers for similar scale             β”‚
β”‚                                                                          β”‚
β”‚   WhatsApp achieved 20x efficiency. How?                                 β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

TODAY'S SCALE (2025)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                          β”‚
β”‚   MONTHLY ACTIVE USERS          DAILY MESSAGES                           β”‚
β”‚   ────────────────────          ──────────────                           β”‚
β”‚   2.5+ Billion                  100+ Billion                             β”‚
β”‚                                                                          β”‚
β”‚   PEAK MESSAGES/SECOND          COUNTRIES                                β”‚
β”‚   ───────────────────           ─────────                                β”‚
β”‚   3+ Million                    180+                                     β”‚
β”‚                                                                          β”‚
β”‚   STATUS UPDATES/DAY            VOICE/VIDEO CALLS                        β”‚
β”‚   ──────────────────            ─────────────────                        β”‚
β”‚   500+ Million                  2+ Billion minutes/day                   β”‚
β”‚                                                                          β”‚
β”‚   ENCRYPTION                    UPTIME                                   β”‚
β”‚   ──────────                    ──────                                   β”‚
β”‚   End-to-End (Signal Protocol)  99.99%+                                  β”‚
β”‚   WhatsApp can't read messages                                           β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This is the system we'll design today β€” and understand why their technology choices matter.


The Interview Begins

You're interviewing at a startup that wants to build the next great messaging app. The CTO draws on the whiteboard:

Interviewer: "WhatsApp proved that a small team can build a messaging platform for billions. I want you to design a messaging system that can scale to that level. Walk me through how you'd approach it."

╔══════════════════════════════════════════════════════════════════════════╗
β•‘                                                                          β•‘
β•‘              Design a Global Messaging Platform                          β•‘
β•‘                                                                          β•‘
β•‘   Build a messaging system that can handle billions of users with:       β•‘
β•‘                                                                          β•‘
β•‘   Requirements:                                                          β•‘
β•‘   β€’ Support 2 billion users                                              β•‘
β•‘   β€’ Handle 100+ billion messages per day                                 β•‘
β•‘   β€’ Real-time message delivery (< 500ms for online users)                β•‘
β•‘   β€’ Offline message storage and delivery                                 β•‘
β•‘   β€’ End-to-end encryption (server cannot read messages)                  β•‘
β•‘   β€’ Group messaging (up to 1000 members)                                 β•‘
β•‘   β€’ Media sharing (images, video, documents)                             β•‘
β•‘   β€’ Presence (online/offline/typing indicators)                          β•‘
β•‘   β€’ Message status (sent, delivered, read)                               β•‘
β•‘   β€’ 99.99% availability                                                  β•‘
β•‘                                                                          β•‘
β•‘   Constraint: Build it with a small team (< 100 engineers)               β•‘
β•‘                                                                          β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Interviewer: "The team size constraint is intentional. I want to see if you understand that architecture choices determine how many engineers you need. WhatsApp proved you can do more with less."


Phase 1: Requirements Clarification

You: "Let me make sure I understand the constraints and can make the right trade-offs."

Your Questions

You: "First, what's the message delivery guarantee? Is it okay if a message is delayed, or must it be truly real-time?"

Interviewer: "For online users, we want sub-second delivery. For offline users, messages should be queued and delivered when they come online. Messages should never be lost."

You: "What about message ordering? If I send two messages quickly, must they arrive in order?"

Interviewer: "Within a single conversation, yes β€” messages must be ordered. But I don't care if Message A to Bob arrives before Message B to Alice, even if B was sent first."

You: "For end-to-end encryption, does the server ever need to read message content?"

Interviewer: "Never. The server should only route encrypted blobs. This is non-negotiable for user trust."

You: "What's the read/write ratio? How often do users check for messages vs send them?"

Interviewer: "Heavy read. Users open the app frequently to check for messages. For every message sent, there are probably 10 message reads/checks. Also, presence updates are extremely chatty β€” typing indicators, online status."

You: "What about media? Video messages can be large."

Interviewer: "Media follows a different path. Upload to storage, send a reference in the message. The encrypted media is stored separately from chat servers."

Requirements Summary

Functional Requirements:

1. MESSAGING
   β€’ One-to-one messaging
   β€’ Group messaging (up to 1024 members)
   β€’ Message types: text, image, video, audio, document, location
   β€’ Message status: sent β†’ delivered β†’ read
   β€’ Reply, forward, delete (for everyone)

2. PRESENCE
   β€’ Online/offline status
   β€’ Last seen timestamp
   β€’ Typing indicators
   β€’ Privacy controls (hide last seen)

3. MEDIA HANDLING
   β€’ Upload and download media
   β€’ Automatic compression
   β€’ Progressive loading (blurry β†’ clear)
   β€’ Media forwarding (reuse uploaded media)

4. SECURITY
   β€’ End-to-end encryption for all messages
   β€’ Device verification (QR code scan)
   β€’ Two-factor authentication

5. SYNC
   β€’ Multi-device support
   β€’ Message sync across devices
   β€’ Local backup (encrypted)

Non-Functional Requirements:

SCALE
β€’ 2 billion users
β€’ 500 million daily active users
β€’ 100 billion messages/day
β€’ 3 million messages/second (peak)

LATENCY
β€’ Online delivery: < 500ms
β€’ Offline delivery: < 5s after coming online
β€’ Typing indicators: < 200ms

AVAILABILITY
β€’ 99.99% uptime
β€’ Graceful degradation (messages queue if delays)

STORAGE
β€’ Messages stored until delivered
β€’ Delivered messages deleted from server
β€’ Media retained for 30 days

EFFICIENCY
β€’ Minimal bandwidth (works on 2G)
β€’ Minimal battery drain
β€’ Small engineering team

Phase 2: Back of the Envelope Estimation

You: "Let me work through the numbers to understand infrastructure needs."

Traffic Calculations

MESSAGE VOLUME

Daily messages:           100,000,000,000 (100 billion)
Seconds per day:          86,400
Average MPS:              ~1.16 million messages/second

Peak multiplier:          3x (evening hours globally)
Peak MPS:                 ~3.5 million messages/second

Each message triggers:
β”œβ”€β”€ Sender acknowledgment: 1 message
β”œβ”€β”€ Recipient delivery:    1 message
β”œβ”€β”€ Read receipt:          1 message
└── Total events/message:  ~3-4x

Effective events/second:  ~10-15 million at peak

Connection Calculations

CONCURRENT CONNECTIONS

Daily active users:       500 million
Peak concurrent:          ~10% = 50 million connections

If each server handles 2 million connections:
β”œβ”€β”€ Servers needed:       25 servers (just for connections!)
β”œβ”€β”€ With redundancy:      50-75 servers

WhatsApp achieved 2-3 million connections per server.
Most companies: 50,000-100,000 per server.
This 20-40x efficiency is the key insight.

Storage Calculations

MESSAGE STORAGE

Average message size:     ~500 bytes (encrypted)
Daily messages:           100 billion
Daily storage:            50 TB/day (if we stored everything)

BUT: WhatsApp deletes messages after delivery!
β”œβ”€β”€ Undelivered at any time: ~5% of daily messages
β”œβ”€β”€ Actual storage:          ~2.5 TB active
└── Plus offline queues:     ~10 TB

Media storage is separate:
β”œβ”€β”€ Daily uploads:        ~10 billion media files
β”œβ”€β”€ Average size:         100 KB (compressed)
β”œβ”€β”€ Daily media:          ~1 PB/day
β”œβ”€β”€ 30-day retention:     ~30 PB
└── With deduplication:   ~15 PB

Phase 3: High-Level Architecture

You: "WhatsApp's architecture is remarkably simple. That's the secret β€” simplicity enables a small team."

The WhatsApp Philosophy

WHATSAPP'S ENGINEERING PRINCIPLES

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚  1. FOCUS                                                               β”‚
β”‚     One thing, done exceptionally well: messaging.                      β”‚
β”‚     No social feed, no stories (initially), no ads.                     β”‚
β”‚     Feature creep is the enemy of reliability.                          β”‚
β”‚                                                                         β”‚
β”‚  2. RELIABILITY OVER FEATURES                                           β”‚
β”‚     A message that doesn't deliver is worse than no feature.            β”‚
β”‚     Every new feature adds complexity and potential failures.           β”‚
β”‚     Say no to most feature requests.                                    β”‚
β”‚                                                                         β”‚
β”‚  3. RIGHT TOOL FOR THE JOB                                              β”‚
β”‚     Erlang for concurrency (designed for telecom switches).             β”‚
β”‚     FreeBSD for networking performance.                                 β”‚
β”‚     Mnesia for in-memory distributed data.                              β”‚
β”‚     Unconventional, but perfect fit.                                    β”‚
β”‚                                                                         β”‚
β”‚  4. SMALL TEAM = BETTER COMMUNICATION                                   β”‚
β”‚     32 engineers means everyone knows the whole system.                 β”‚
β”‚     Every line of code reviewed by founders (initially).                β”‚
β”‚     No coordination overhead between 50 microservice teams.             β”‚
β”‚                                                                         β”‚
β”‚  5. LET IT CRASH                                                        β”‚
β”‚     Erlang philosophy: processes crash, supervisors restart.            β”‚
β”‚     Don't try to handle every error β€” let it fail and recover.          β”‚
β”‚     Simple, predictable failure recovery.                               β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

System Architecture

WHATSAPP ARCHITECTURE

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                        β”‚
β”‚                           CLIENT LAYER                                 β”‚
β”‚                                                                        β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚    β”‚   Android   β”‚  β”‚     iOS     β”‚  β”‚  WhatsApp   β”‚                   β”‚
β”‚    β”‚    App      β”‚  β”‚    App      β”‚  β”‚    Web      β”‚                   β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚           β”‚                β”‚                β”‚                          β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                            β”‚                                           β”‚
β”‚                            β”‚  (Persistent TCP + TLS)                   β”‚
β”‚                            β”‚  (Custom binary protocol)                 β”‚
β”‚                            β”‚                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚                          EDGE / GATEWAY                                 β”‚
β”‚                                                                         β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚    β”‚                    LOAD BALANCERS                               β”‚  β”‚
β”‚    β”‚         (GeoDNS routes to nearest data center)                  β”‚  β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                               β”‚                                         β”‚
β”‚                               β–Ό                                         β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚    β”‚                    CHAT SERVERS                                 β”‚  β”‚
β”‚    β”‚                   (Erlang + FreeBSD)                            β”‚  β”‚
β”‚    β”‚                                                                 β”‚  β”‚
β”‚    β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚  β”‚
β”‚    β”‚   β”‚ Chat 1  β”‚ β”‚ Chat 2  β”‚ β”‚ Chat 3  β”‚ β”‚ Chat N  β”‚               β”‚  β”‚
β”‚    β”‚   β”‚         β”‚ β”‚         β”‚ β”‚         β”‚ β”‚         β”‚               β”‚  β”‚
β”‚    β”‚   β”‚  2-3M   β”‚ β”‚  2-3M   β”‚ β”‚  2-3M   β”‚ β”‚  2-3M   β”‚               β”‚  β”‚
β”‚    β”‚   β”‚ conns   β”‚ β”‚ conns   β”‚ β”‚ conns   β”‚ β”‚ conns   β”‚               β”‚  β”‚
β”‚    β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚  β”‚
β”‚    β”‚                                                                 β”‚  β”‚
β”‚    β”‚   Each connection = 1 Erlang process (~2KB memory)              β”‚  β”‚
β”‚    β”‚   Process handles: auth, presence, message routing              β”‚  β”‚
β”‚    β”‚                                                                 β”‚  β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                               β”‚                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                        β”‚
β”‚                         BACKEND SERVICES                               β”‚
β”‚                                                                        β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚    β”‚   MESSAGE     β”‚    β”‚   OFFLINE     β”‚    β”‚    MEDIA      β”‚         β”‚
β”‚    β”‚   ROUTING     β”‚    β”‚    STORE      β”‚    β”‚   STORAGE     β”‚         β”‚
β”‚    β”‚               β”‚    β”‚               β”‚    β”‚               β”‚         β”‚
β”‚    β”‚  Direct P2P   β”‚    β”‚  Queue for    β”‚    β”‚  S3-like      β”‚         β”‚
β”‚    β”‚  between      β”‚    β”‚  offline      β”‚    β”‚  blob store   β”‚         β”‚
β”‚    β”‚  chat servers β”‚    β”‚  users        β”‚    β”‚  for media    β”‚         β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚                                                                        β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚    β”‚     USER      β”‚    β”‚    GROUP      β”‚    β”‚    KEYS       β”‚         β”‚
β”‚    β”‚   REGISTRY    β”‚    β”‚   METADATA    β”‚    β”‚   (for E2E)   β”‚         β”‚
β”‚    β”‚               β”‚    β”‚               β”‚    β”‚               β”‚         β”‚
│    │  User→Server  │    │  Group        │    │  Public key   │         │
β”‚    β”‚  mapping      β”‚    β”‚  membership   β”‚    β”‚  distribution β”‚         β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚                                                                        β”‚
β”‚    All backed by: Mnesia (Erlang distributed DB) + ETS (in-memory)     β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Message Flow

MESSAGE FLOW: ALICE SENDS "HELLO" TO BOB

Alice's Phone              WhatsApp Servers              Bob's Phone
     β”‚                           β”‚                            β”‚
     β”‚  β‘  ENCRYPT                β”‚                            β”‚
     β”‚  Message encrypted        β”‚                            β”‚
     β”‚  with Bob's public key    β”‚                            β”‚
     β”‚  (Signal Protocol)        β”‚                            β”‚
     β”‚                           β”‚                            β”‚
     β”‚  β‘‘ SEND                   β”‚                            β”‚
     β”‚  ─────────────────────────▢                            β”‚
     β”‚  Encrypted blob +         β”‚                            β”‚
     β”‚  recipient: bob@wa        β”‚                            β”‚
     β”‚                           β”‚                            β”‚
     β”‚  β‘’ ACK (SENT βœ“)           β”‚                            β”‚
     β”‚  ◀─────────────────────────                            β”‚
     β”‚  Server received message  β”‚                            β”‚
     β”‚  (Single tick)            β”‚                            β”‚
     β”‚                           β”‚                            β”‚
     β”‚                           β”‚  β‘£ LOOKUP                  β”‚
     β”‚                           β”‚  Where is Bob connected?   β”‚
     β”‚                           β”‚  User registry β†’ Server 7  β”‚
     β”‚                           β”‚                            β”‚
     β”‚                           β”‚                            β”‚
     β”‚                     IF BOB IS ONLINE                   β”‚
     β”‚                           β”‚                            β”‚
     β”‚                           β”‚  β‘€ ROUTE                   β”‚
     β”‚                           β”‚  ─────────────────────────▢│
     β”‚                           β”‚  Forward to Bob's process  β”‚
     β”‚                           β”‚                            β”‚
     β”‚                           β”‚  β‘₯ DELIVERED ACK           β”‚
     β”‚                           β”‚  ◀─────────────────────────│
     β”‚                           β”‚  Bob's device got it       β”‚
     β”‚                           β”‚                            β”‚
     β”‚  ⑦ DELIVERED (βœ“βœ“)         β”‚                            β”‚
     β”‚  ◀─────────────────────────                            β”‚
     β”‚  (Double tick)            β”‚                            β”‚
     β”‚                           β”‚                            β”‚
     β”‚                           β”‚                            β”‚
     β”‚                     BOB READS THE MESSAGE              β”‚
     β”‚                           β”‚                            β”‚
     β”‚                           β”‚  β‘§ READ RECEIPT            β”‚
     β”‚                           β”‚  ◀─────────────────────────│
     β”‚                           β”‚  Bob opened the chat       β”‚
     β”‚                           β”‚                            β”‚
     β”‚  ⑨ READ (βœ“βœ“ blue)        β”‚                            β”‚
     β”‚  ◀─────────────────────────                            β”‚
     β”‚  (Blue double tick)       β”‚                            β”‚
     β”‚                           β”‚                            β”‚
     β–Ό                           β–Ό                            β–Ό

KEY INSIGHT: Server NEVER decrypts the message.
It only routes the encrypted blob and tracks delivery status.

Phase 4: Deep Dives

Deep Dive 1: Why Erlang?

Week 2 concepts: Failure handling, resilience. Week 10 concepts: Operational simplicity.

You: "WhatsApp's secret weapon is Erlang β€” a language designed in the 1980s for telephone switches. It's a perfect fit for messaging."

The Challenge:

THE CONCURRENCY PROBLEM

Traditional approach (Java/Python/Go):
β”œβ”€β”€ Each connection = 1 OS thread
β”œβ”€β”€ OS thread = ~1MB memory overhead
β”œβ”€β”€ Context switching is expensive
β”œβ”€β”€ 1 million connections = 1 million threads = 1TB RAM!
└── Practical limit: ~50,000-100,000 connections/server

WhatsApp needs:
β”œβ”€β”€ 2-3 million connections per server
β”œβ”€β”€ Each connection needs state (session, queue)
β”œβ”€β”€ Must handle failures gracefully
└── Hot code reloading (no restarts for deploys)

Why Erlang Solves This:

ERLANG'S SUPERPOWERS

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚  1. LIGHTWEIGHT PROCESSES                                               β”‚
β”‚  ─────────────────────────                                              β”‚
β”‚                                                                         β”‚
β”‚  Erlang processes are NOT OS threads.                                   β”‚
β”‚  They're managed by the BEAM virtual machine.                           β”‚
β”‚                                                                         β”‚
β”‚  OS Thread:          Erlang Process:                                    β”‚
β”‚  β€’ ~1MB memory       β€’ ~2KB memory                                      β”‚
β”‚  β€’ Expensive switch  β€’ Cheap switch                                     β”‚
β”‚  β€’ Limited count     β€’ Millions possible                                β”‚
β”‚                                                                         β”‚
β”‚  WhatsApp runs 1 Erlang process per user connection.                    β”‚
β”‚  2 million connections = 2 million processes = ~4GB RAM                 β”‚
β”‚  (vs 2TB with OS threads)                                               β”‚
β”‚                                                                         β”‚
β”‚  ─────────────────────────────────────────────────────────────────────  β”‚
β”‚                                                                         β”‚
β”‚  2. "LET IT CRASH" PHILOSOPHY                                           β”‚
β”‚  ────────────────────────────                                           β”‚
β”‚                                                                         β”‚
β”‚  Traditional: Try to handle every error (complex, error-prone)          β”‚
β”‚  Erlang: Let processes crash, supervisors restart them                  β”‚
β”‚                                                                         β”‚
β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                          β”‚
β”‚            β”‚   SUPERVISOR    β”‚ (monitors children)                      β”‚
β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                          β”‚
β”‚                     β”‚                                                   β”‚
β”‚      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                    β”‚
β”‚      β–Ό              β–Ό              β–Ό                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”                                β”‚
β”‚  β”‚User A β”‚      β”‚User B β”‚      β”‚User C β”‚                                β”‚
β”‚  β”‚Processβ”‚      β”‚Processβ”‚      β”‚Processβ”‚                                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”˜                                β”‚
β”‚                     β”‚                                                   β”‚
β”‚                     β”‚ CRASH!                                            β”‚
β”‚                     β–Ό                                                   β”‚
β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”                                               β”‚
β”‚                 β”‚User B β”‚ ← Supervisor restarts                         β”‚
β”‚                 β”‚Processβ”‚   in milliseconds                             β”‚
β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”˜                                               β”‚
β”‚                                                                         β”‚
β”‚  User A and C never notice. User B reconnects automatically.            β”‚
β”‚                                                                         β”‚
β”‚  ─────────────────────────────────────────────────────────────────────  β”‚
β”‚                                                                         β”‚
β”‚  3. HOT CODE RELOADING                                                  β”‚
β”‚  ─────────────────────                                                  β”‚
β”‚                                                                         β”‚
β”‚  Deploy new code WITHOUT restarting the server.                         β”‚
β”‚  Connections stay alive during deployments.                             β”‚
β”‚  No maintenance windows needed.                                         β”‚
β”‚                                                                         β”‚
β”‚  ─────────────────────────────────────────────────────────────────────  β”‚
β”‚                                                                         β”‚
β”‚  4. DISTRIBUTED BY DESIGN                                               β”‚
β”‚  ─────────────────────────                                              β”‚
β”‚                                                                         β”‚
β”‚  Erlang processes can communicate across machines seamlessly.           β”‚
β”‚  Send a message to a process on another server = same syntax.           β”‚
β”‚  Built-in distribution, no external message queue needed.               β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

WhatsApp's Erlang Customizations:

# Pseudocode: How WhatsApp structures connection handling
# (Actual WhatsApp code is Erlang, this shows the concept)

"""
WhatsApp Connection Handler

Each user connection runs as a single Erlang process.
This process:
1. Owns the TCP socket
2. Maintains user session state
3. Routes messages to/from the user
4. Dies cleanly when user disconnects
"""

class UserConnectionProcess:
    """
    Conceptual model of WhatsApp's per-user Erlang process.
    
    In Erlang, this is implemented with gen_server behavior.
    Memory: ~2KB per process
    Lifecycle: Lives as long as user is connected
    """
    
    def __init__(self, socket, user_id):
        self.socket = socket
        self.user_id = user_id
        self.session_key = None
        self.message_queue = []  # For messages while processing
        self.last_seen = now()
        
        # Register this process in the global user registry
        UserRegistry.register(user_id, self.pid)
    
    def handle_message(self, msg):
        """
        Main message handler β€” called for every incoming message.
        
        Erlang processes have a mailbox β€” messages queue up
        and are processed one at a time.
        """
        match msg:
            case AuthRequest(credentials):
                self.handle_auth(credentials)
            
            case OutgoingMessage(recipient, encrypted_content):
                self.route_to_recipient(recipient, encrypted_content)
            
            case IncomingMessage(sender, encrypted_content):
                self.deliver_to_socket(sender, encrypted_content)
            
            case PresenceUpdate(status):
                self.broadcast_presence(status)
            
            case Disconnect():
                self.cleanup_and_die()
    
    def route_to_recipient(self, recipient_id, content):
        """
        Route message to another user.
        
        This is the core of WhatsApp's routing:
        1. Look up recipient's process (which server?)
        2. Send message directly to that process
        """
        # Find recipient's process (could be on any server)
        recipient_process = UserRegistry.lookup(recipient_id)
        
        if recipient_process:
            # Direct process-to-process communication
            # Works across servers in Erlang!
            send(recipient_process, IncomingMessage(
                sender=self.user_id,
                content=content
            ))
            self.send_ack(MessageStatus.DELIVERED)
        else:
            # User offline β€” store for later
            OfflineStore.queue(recipient_id, content)
            self.send_ack(MessageStatus.SENT)
    
    def cleanup_and_die(self):
        """
        Clean shutdown when user disconnects.
        
        In Erlang, when a process dies, all its memory
        is automatically reclaimed. No memory leaks possible.
        """
        UserRegistry.unregister(self.user_id)
        close(self.socket)
        # Process simply terminates β€” Erlang handles cleanup

The "Island" Architecture:

WHATSAPP'S ISLAND ARCHITECTURE

To limit blast radius of failures, WhatsApp partitions
servers into independent "islands."

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                          β”‚
β”‚   ISLAND 1                    ISLAND 2                   ISLAND 3        β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Chat Servers   β”‚        β”‚  Chat Servers   β”‚        β”‚ Chat Servers β”‚  β”‚
β”‚  β”‚  1, 2, 3, 4     β”‚        β”‚  5, 6, 7, 8     β”‚        β”‚ 9, 10, 11    β”‚  β”‚
β”‚  β”‚                 β”‚        β”‚                 β”‚        β”‚              β”‚  β”‚
β”‚  β”‚  Users A-H      β”‚        β”‚  Users I-P      β”‚        β”‚ Users Q-Z    β”‚  β”‚
β”‚  β”‚                 β”‚        β”‚                 β”‚        β”‚              β”‚  β”‚
β”‚  β”‚  Replica: 1a    β”‚        β”‚  Replica: 2a    β”‚        β”‚ Replica: 3a  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚                          β”‚                         β”‚           β”‚
β”‚         β”‚                          β”‚                         β”‚           β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                                    β”‚                                     β”‚
β”‚                                    β–Ό                                     β”‚
β”‚                         Cross-Island Routing                             β”‚
│                         (for A→Z messages)                               │
β”‚                                                                          β”‚
β”‚  BENEFITS:                                                               β”‚
β”‚  β€’ Island 1 can fail without affecting Island 2                          β”‚
β”‚  β€’ Upgrades can be rolled out island by island                           β”‚
β”‚  β€’ Each island has its own replica for failover                          β”‚
β”‚  β€’ Limits "blast radius" of any failure                                  β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deep Dive 2: End-to-End Encryption (Signal Protocol)

Week 9 concepts: Security, encryption.

You: "WhatsApp uses the Signal Protocol for end-to-end encryption. This is critical β€” even WhatsApp can't read your messages."

END-TO-END ENCRYPTION OVERVIEW

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚   WITHOUT E2EE:                                                         β”‚
β”‚   ─────────────                                                         β”‚
β”‚                                                                         β”‚
β”‚   Alice ──"Hi"──▢ Server ──"Hi"──▢ Bob                                  β”‚
β”‚                    β”‚                                                    β”‚
β”‚                    β”‚ Server can read "Hi"                               β”‚
β”‚                    β”‚ Hackers who breach server can read "Hi"            β”‚
β”‚                    β”‚ Government subpoenas can access "Hi"               β”‚
β”‚                                                                         β”‚
β”‚   WITH E2EE:                                                            β”‚
β”‚   ──────────                                                            β”‚
β”‚                                                                         β”‚
β”‚   Alice ──[encrypted blob]──▢ Server ──[encrypted blob]──▢ Bob          β”‚
β”‚                                  β”‚                                      β”‚
β”‚                                  β”‚ Server sees random bytes             β”‚
β”‚                                  β”‚ Only Alice and Bob can decrypt       β”‚
β”‚                                  β”‚ Even WhatsApp can't read it          β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Signal Protocol Components:

SIGNAL PROTOCOL: THE BUILDING BLOCKS

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚  1. KEY PAIRS                                                           β”‚
β”‚  ─────────────                                                          β”‚
β”‚                                                                         β”‚
β”‚  Each user has multiple key pairs:                                      β”‚
β”‚                                                                         β”‚
β”‚  Identity Key (long-term):                                              β”‚
β”‚  β€’ Generated once when you install WhatsApp                             β”‚
β”‚  β€’ Never changes (unless you reinstall)                                 β”‚
β”‚  β€’ Public part uploaded to WhatsApp servers                             β”‚
β”‚                                                                         β”‚
β”‚  Signed Pre-Key (medium-term):                                          β”‚
β”‚  β€’ Rotates periodically                                                 β”‚
β”‚  β€’ Signed by Identity Key                                               β”‚
β”‚  β€’ Uploaded to servers                                                  β”‚
β”‚                                                                         β”‚
β”‚  One-Time Pre-Keys (ephemeral):                                         β”‚
β”‚  β€’ Batch of ~100 keys uploaded                                          β”‚
β”‚  β€’ Each used once then discarded                                        β”‚
β”‚  β€’ Provides forward secrecy                                             β”‚
β”‚                                                                         β”‚
β”‚  ─────────────────────────────────────────────────────────────────────  β”‚
β”‚                                                                         β”‚
β”‚  2. X3DH KEY AGREEMENT                                                  β”‚
β”‚  ─────────────────────                                                  β”‚
β”‚                                                                         β”‚
β”‚  When Alice messages Bob for the first time:                            β”‚
β”‚                                                                         β”‚
β”‚  Alice                         Server                         Bob       β”‚
β”‚    β”‚                             β”‚                             β”‚        β”‚
β”‚    β”‚  "Give me Bob's keys"       β”‚                             β”‚        β”‚
β”‚    │────────────────────────────▢│                             β”‚        β”‚
β”‚    β”‚                             β”‚                             β”‚        β”‚
β”‚    β”‚  Bob's: Identity Key        β”‚                             β”‚        β”‚
β”‚    β”‚         Signed Pre-Key      β”‚                             β”‚        β”‚
β”‚    β”‚         One-Time Pre-Key    β”‚                             β”‚        β”‚
β”‚    │◀────────────────────────────│                             β”‚        β”‚
β”‚    β”‚                             β”‚                             β”‚        β”‚
β”‚    β”‚                                                           β”‚        β”‚
β”‚    β”‚  Alice computes shared secret using:                      β”‚        β”‚
β”‚    β”‚  β€’ Her Identity Key + Bob's Signed Pre-Key                β”‚        β”‚
β”‚    β”‚  β€’ Her Identity Key + Bob's One-Time Pre-Key              β”‚        β”‚
β”‚    β”‚  β€’ Her Ephemeral Key + Bob's Identity Key                 β”‚        β”‚
β”‚    β”‚  β€’ Her Ephemeral Key + Bob's Signed Pre-Key               β”‚        β”‚
β”‚    β”‚                                                           β”‚        β”‚
β”‚    β”‚  β†’ Shared Secret (only Alice and Bob can compute)         β”‚        β”‚
β”‚    β”‚                                                           β”‚        β”‚
β”‚                                                                         β”‚
β”‚  ─────────────────────────────────────────────────────────────────────  β”‚
β”‚                                                                         β”‚
β”‚  3. DOUBLE RATCHET                                                      β”‚
β”‚  ─────────────────                                                      β”‚
β”‚                                                                         β”‚
β”‚  After initial key agreement, keys evolve with EVERY message:           β”‚
β”‚                                                                         β”‚
β”‚  Message 1: Encrypt with Key_1                                          β”‚
β”‚  Message 2: Encrypt with Key_2 (derived from Key_1)                     β”‚
β”‚  Message 3: Encrypt with Key_3 (derived from Key_2)                     β”‚
β”‚  ...                                                                    β”‚
β”‚                                                                         β”‚
β”‚  Benefits:                                                              β”‚
β”‚  β€’ Forward Secrecy: Compromise of Key_3 doesn't reveal Msg 1-2          β”‚
β”‚  β€’ Post-Compromise Security: If key leaked, future keys are safe        β”‚
β”‚  β€’ Break-in Recovery: System "heals" itself over time                   β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# encryption/signal_protocol.py

"""
Simplified illustration of Signal Protocol concepts.

The actual implementation uses:
- Curve25519 for key exchange
- AES-256-CBC for encryption
- HMAC-SHA256 for authentication
"""

from dataclasses import dataclass
from typing import Optional, Tuple
import os
import hashlib
import hmac


@dataclass
class KeyPair:
    """A public/private key pair."""
    public_key: bytes
    private_key: bytes


@dataclass
class SessionState:
    """State for an encrypted session with another user."""
    their_identity_key: bytes
    their_signed_prekey: bytes
    
    # Ratchet state
    root_key: bytes
    chain_key: bytes
    message_number: int = 0


class SignalSession:
    """
    Simplified Signal Protocol session.
    
    This illustrates the key concepts β€” actual implementation
    is more complex with proper cryptographic primitives.
    """
    
    def __init__(
        self,
        my_identity: KeyPair,
        their_identity_public: bytes,
        their_signed_prekey: bytes,
        their_one_time_prekey: Optional[bytes] = None
    ):
        self.my_identity = my_identity
        
        # Perform X3DH key agreement
        self.session = self._x3dh_key_agreement(
            their_identity_public,
            their_signed_prekey,
            their_one_time_prekey
        )
    
    def _x3dh_key_agreement(
        self,
        their_identity: bytes,
        their_signed_prekey: bytes,
        their_one_time_prekey: Optional[bytes]
    ) -> SessionState:
        """
        X3DH: Extended Triple Diffie-Hellman.
        
        Creates a shared secret from multiple DH exchanges.
        """
        # Generate ephemeral key for this session
        ephemeral = self._generate_keypair()
        
        # Perform multiple DH exchanges
        dh1 = self._dh(self.my_identity.private_key, their_signed_prekey)
        dh2 = self._dh(ephemeral.private_key, their_identity)
        dh3 = self._dh(ephemeral.private_key, their_signed_prekey)
        
        if their_one_time_prekey:
            dh4 = self._dh(ephemeral.private_key, their_one_time_prekey)
            shared_secret = self._kdf(dh1 + dh2 + dh3 + dh4)
        else:
            shared_secret = self._kdf(dh1 + dh2 + dh3)
        
        # Derive initial root and chain keys
        root_key, chain_key = self._derive_keys(shared_secret)
        
        return SessionState(
            their_identity_key=their_identity,
            their_signed_prekey=their_signed_prekey,
            root_key=root_key,
            chain_key=chain_key
        )
    
    def encrypt(self, plaintext: bytes) -> Tuple[bytes, int]:
        """
        Encrypt a message using current chain key.
        
        Each message gets a unique key (Double Ratchet).
        """
        # Derive message key from chain key
        message_key, new_chain_key = self._ratchet_chain(
            self.session.chain_key
        )
        
        # Encrypt with message key
        ciphertext = self._aes_encrypt(message_key, plaintext)
        
        # Update chain key (ratchet forward)
        message_number = self.session.message_number
        self.session.chain_key = new_chain_key
        self.session.message_number += 1
        
        return ciphertext, message_number
    
    def decrypt(self, ciphertext: bytes, message_number: int) -> bytes:
        """
        Decrypt a message.
        
        May need to skip ahead if messages arrive out of order.
        """
        # Derive the message key for this message number
        message_key = self._get_message_key(message_number)
        
        # Decrypt
        plaintext = self._aes_decrypt(message_key, ciphertext)
        
        return plaintext
    
    def _ratchet_chain(self, chain_key: bytes) -> Tuple[bytes, bytes]:
        """
        Symmetric ratchet: derive next keys from current.
        
        This is the "forward secrecy" mechanism β€” old keys
        can't be derived from new keys.
        """
        # Message key for current message
        message_key = hmac.new(chain_key, b"\x01", hashlib.sha256).digest()
        
        # New chain key for next message
        new_chain_key = hmac.new(chain_key, b"\x02", hashlib.sha256).digest()
        
        return message_key, new_chain_key


class GroupEncryption:
    """
    Group messaging uses a different approach.
    
    Instead of pairwise encryption, sender distributes
    a "sender key" to all group members.
    """
    
    def __init__(self, group_id: str, my_user_id: str):
        self.group_id = group_id
        self.my_user_id = my_user_id
        
        # Generate my sender key for this group
        self.my_sender_key = self._generate_sender_key()
        
        # Store other members' sender keys
        self.member_sender_keys = {}
    
    def distribute_sender_key(self, member_sessions: dict):
        """
        Send my sender key to all group members.
        
        Uses pairwise Signal sessions for secure distribution.
        """
        for member_id, session in member_sessions.items():
            # Encrypt sender key using pairwise session
            encrypted_key = session.encrypt(self.my_sender_key)
            # Send to member via normal message path
            yield member_id, encrypted_key
    
    def encrypt_for_group(self, plaintext: bytes) -> bytes:
        """
        Encrypt message using my sender key.
        
        All group members who have my sender key can decrypt.
        Much more efficient than pairwise encryption.
        """
        # Use sender key with chain ratchet
        message_key = self._derive_message_key()
        ciphertext = self._aes_encrypt(message_key, plaintext)
        
        # Ratchet forward
        self._ratchet_sender_key()
        
        return ciphertext
    
    def handle_member_left(self, member_id: str):
        """
        When someone leaves, all sender keys must be regenerated.
        
        This ensures the departed member can't read future messages.
        """
        # Generate new sender key
        self.my_sender_key = self._generate_sender_key()
        
        # Redistribute to remaining members
        # (handled by caller)

What WhatsApp CAN See (Metadata):

ENCRYPTION SCOPE

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚  ENCRYPTED (WhatsApp cannot read):                                      β”‚
β”‚  ─────────────────────────────────                                      β”‚
β”‚  β€’ Message content                                                      β”‚
β”‚  β€’ Media files (photos, videos, documents)                              β”‚
β”‚  β€’ Voice notes                                                          β”‚
β”‚  β€’ Voice and video call content                                         β”‚
β”‚  β€’ Status updates                                                       β”‚
β”‚                                                                         β”‚
β”‚  NOT ENCRYPTED (WhatsApp can see):                                      β”‚
β”‚  ─────────────────────────────────                                      β”‚
β”‚  β€’ Who messaged whom (sender/recipient)                                 β”‚
β”‚  β€’ When messages were sent                                              β”‚
β”‚  β€’ Message size                                                         β”‚
β”‚  β€’ Device information                                                   β”‚
β”‚  β€’ Phone numbers                                                        β”‚
β”‚  β€’ Profile photos                                                       β”‚
β”‚  β€’ Group membership                                                     β”‚
β”‚  β€’ Online/offline status                                                β”‚
β”‚                                                                         β”‚
β”‚  This metadata is necessary for routing but reveals communication       β”‚
β”‚  patterns even if content is secret.                                    β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deep Dive 3: Offline Message Handling

Week 3 concepts: Message queuing, delivery guarantees.

You: "Most users aren't online all the time. Offline message storage is critical."

OFFLINE MESSAGE FLOW

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                          β”‚
β”‚  Alice sends message to offline Bob                                      β”‚
β”‚                                                                          β”‚
β”‚  Alice                     WhatsApp                    Bob               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚  β‘  Send (encrypted)     β”‚                          β”‚ (offline)     β”‚
β”‚    │─────────────────────────▢│                          β”‚               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚  β‘‘ ACK (single βœ“)       β”‚                          β”‚               β”‚
β”‚    │◀─────────────────────────│                          β”‚               β”‚
β”‚    β”‚  "Message received       β”‚                          β”‚               β”‚
β”‚    β”‚   by server"             β”‚                          β”‚               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚                    β‘’ Store in                      β”‚               β”‚
β”‚    β”‚                    Offline Queue                    β”‚               β”‚
β”‚    β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚               β”‚
β”‚    β”‚                    β”‚ Bob's   β”‚                      β”‚               β”‚
β”‚    β”‚                    β”‚ Queue   β”‚                      β”‚               β”‚
β”‚    β”‚                    β”‚ ─────── β”‚                      β”‚               β”‚
β”‚    β”‚                    β”‚ Msg 1   β”‚                      β”‚               β”‚
β”‚    β”‚                    β”‚ Msg 2   β”‚                      β”‚               β”‚
β”‚    β”‚                    β”‚ ...     β”‚                      β”‚               β”‚
β”‚    β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚          ... TIME PASSES (Bob offline) ...          β”‚               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚                          β”‚  β‘£ Bob comes online     β”‚               β”‚
β”‚    β”‚                          │◀─────────────────────────│               β”‚
β”‚    β”‚                          β”‚  TCP connection          β”‚               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚                          β”‚  β‘€ Drain queue          β”‚               β”‚
β”‚    β”‚                          │─────────────────────────▢│               β”‚
β”‚    β”‚                          β”‚  Send all pending msgs   β”‚               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚  β‘₯ Delivered (βœ“βœ“)       β”‚  ⑦ Delivery ACK         β”‚               β”‚
β”‚    │◀─────────────────────────│◀─────────────────────────│               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β”‚                    β‘§ Delete from                   β”‚               β”‚
β”‚    β”‚                    queue (delivered)                β”‚               β”‚
β”‚    β”‚                          β”‚                          β”‚               β”‚
β”‚    β–Ό                          β–Ό                          β–Ό               β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# offline/message_store.py

"""
Offline Message Storage

WhatsApp's approach:
- Messages stored only until delivered
- After delivery confirmation, deleted from server
- This minimizes server storage and privacy exposure
"""

from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime, timedelta


@dataclass
class QueuedMessage:
    """A message waiting for delivery."""
    message_id: str
    sender_id: str
    recipient_id: str
    encrypted_content: bytes
    timestamp: datetime
    expiry: datetime  # Messages expire after 30 days
    
    # Delivery tracking
    delivery_attempts: int = 0
    last_attempt: Optional[datetime] = None


class OfflineMessageStore:
    """
    Stores messages for offline users.
    
    Design principles:
    - In-memory for speed (Mnesia ETS tables)
    - Replicated for durability
    - Partitioned by recipient for scalability
    - Messages deleted after delivery
    """
    
    def __init__(self, db, max_queue_size: int = 10000):
        self.db = db  # Mnesia in actual WhatsApp
        self.max_queue_size = max_queue_size
        
        # TTL for stored messages
        self.message_ttl = timedelta(days=30)
    
    async def queue_message(
        self,
        sender_id: str,
        recipient_id: str,
        encrypted_content: bytes
    ) -> QueuedMessage:
        """
        Store message for offline recipient.
        """
        message = QueuedMessage(
            message_id=generate_id(),
            sender_id=sender_id,
            recipient_id=recipient_id,
            encrypted_content=encrypted_content,
            timestamp=datetime.utcnow(),
            expiry=datetime.utcnow() + self.message_ttl
        )
        
        # Check queue size limits
        queue_size = await self.db.count_for_recipient(recipient_id)
        if queue_size >= self.max_queue_size:
            # Drop oldest message (or reject new one)
            await self.db.delete_oldest(recipient_id)
        
        await self.db.save(message)
        return message
    
    async def get_pending_messages(
        self,
        recipient_id: str
    ) -> List[QueuedMessage]:
        """
        Get all pending messages for a user who just came online.
        
        Called when user establishes connection.
        """
        messages = await self.db.get_by_recipient(
            recipient_id,
            order_by="timestamp",
            limit=1000  # Batch size
        )
        
        # Filter expired messages
        now = datetime.utcnow()
        valid_messages = [m for m in messages if m.expiry > now]
        
        return valid_messages
    
    async def mark_delivered(
        self,
        message_ids: List[str]
    ):
        """
        Delete messages that have been delivered.
        
        This is called after receiving delivery ACK from recipient.
        """
        for message_id in message_ids:
            await self.db.delete(message_id)
    
    async def cleanup_expired(self):
        """
        Background job: Delete expired messages.
        
        Runs periodically to clean up messages that
        were never delivered (user never came back online).
        """
        now = datetime.utcnow()
        expired = await self.db.get_expired(now)
        
        for message in expired:
            await self.db.delete(message.message_id)
            
            # Optionally notify sender that message expired
            # (WhatsApp doesn't do this)


class WriteBackCache:
    """
    WhatsApp uses write-back caching for offline messages.
    
    - Messages stored in memory first (fast)
    - Asynchronously persisted to disk
    - Handles disk I/O slowdowns gracefully
    """
    
    def __init__(self, memory_store, disk_store):
        self.memory = memory_store   # ETS table
        self.disk = disk_store       # Mnesia disk table
        self.pending_writes = []
    
    async def store(self, message: QueuedMessage):
        """
        Store in memory immediately, persist async.
        """
        # Immediate: memory storage
        await self.memory.put(message.message_id, message)
        
        # Async: queue for disk persistence
        self.pending_writes.append(message)
        
        # Trigger background write (non-blocking)
        asyncio.create_task(self._flush_writes())
    
    async def _flush_writes(self):
        """
        Background: persist to disk.
        
        Batched for efficiency.
        """
        if len(self.pending_writes) >= 100:
            batch = self.pending_writes[:100]
            self.pending_writes = self.pending_writes[100:]
            
            await self.disk.batch_write(batch)

Deep Dive 4: Presence and Typing Indicators

Week 4 concepts: Caching, real-time updates.

You: "Presence is the 'chattiest' part of the system β€” far more traffic than messages."

PRESENCE TRAFFIC ANALYSIS

For every message sent, there are ~10-100 presence events:
β”œβ”€β”€ User opens app (online)
β”œβ”€β”€ User starts typing (typing indicator)
β”œβ”€β”€ User stops typing
β”œβ”€β”€ User reads messages (last seen update)
β”œβ”€β”€ User closes app (offline)
└── Repeated for every chat participant watching

With 500M daily active users:
β”œβ”€β”€ Messages: ~100 billion/day
β”œβ”€β”€ Presence events: ~1 trillion/day
└── This is 10x the message traffic!

Presence CANNOT go through the same path as messages.
It needs aggressive optimization.
# presence/manager.py

"""
Presence Management

Optimizations WhatsApp uses:
1. Batching: Aggregate multiple presence updates
2. Rate limiting: Don't send typing every keystroke
3. TTL: Presence expires automatically
4. Selective delivery: Only send to active chats
"""

from dataclasses import dataclass
from typing import Dict, Set, Optional
from datetime import datetime, timedelta
from enum import Enum


class PresenceState(Enum):
    ONLINE = "online"
    OFFLINE = "offline"
    TYPING = "typing"


@dataclass
class UserPresence:
    """Current presence state for a user."""
    user_id: str
    state: PresenceState
    last_seen: datetime
    typing_in_chat: Optional[str] = None


class PresenceManager:
    """
    Manages presence state and notifications.
    
    Key optimizations:
    1. In-memory only (no persistence needed)
    2. TTL-based expiry (online expires in 60s)
    3. Rate limiting (typing indicator throttled)
    4. Subscription-based (only notify interested parties)
    """
    
    def __init__(self, redis_cluster, notification_service):
        self.redis = redis_cluster
        self.notifier = notification_service
        
        # TTLs
        self.online_ttl = timedelta(seconds=60)
        self.typing_ttl = timedelta(seconds=5)
    
    async def set_online(self, user_id: str):
        """
        Mark user as online.
        
        Uses Redis with TTL β€” automatically expires to offline.
        """
        presence = UserPresence(
            user_id=user_id,
            state=PresenceState.ONLINE,
            last_seen=datetime.utcnow()
        )
        
        # Store with TTL (auto-expires to offline)
        await self.redis.set(
            f"presence:{user_id}",
            presence.to_json(),
            ex=int(self.online_ttl.total_seconds())
        )
        
        # Notify subscribers
        await self._notify_subscribers(user_id, presence)
    
    async def set_typing(self, user_id: str, chat_id: str):
        """
        Mark user as typing in a specific chat.
        
        Rate limited: max 1 update per second per chat.
        """
        # Rate limit check
        rate_key = f"typing_rate:{user_id}:{chat_id}"
        if await self.redis.exists(rate_key):
            return  # Already sent recently, skip
        
        # Set rate limit marker (1 second)
        await self.redis.set(rate_key, "1", ex=1)
        
        # Update typing state
        await self.redis.set(
            f"typing:{user_id}:{chat_id}",
            "1",
            ex=int(self.typing_ttl.total_seconds())
        )
        
        # Notify only the other participant(s) of this chat
        await self._notify_chat_participants(
            chat_id, 
            user_id, 
            PresenceState.TYPING
        )
    
    async def get_presence(self, user_id: str) -> Optional[UserPresence]:
        """
        Get current presence for a user.
        
        Used when opening a chat to show current status.
        """
        data = await self.redis.get(f"presence:{user_id}")
        
        if data:
            return UserPresence.from_json(data)
        else:
            # No presence record = offline
            # Get last seen from persistent storage
            last_seen = await self._get_last_seen(user_id)
            return UserPresence(
                user_id=user_id,
                state=PresenceState.OFFLINE,
                last_seen=last_seen
            )
    
    async def _notify_subscribers(
        self,
        user_id: str,
        presence: UserPresence
    ):
        """
        Notify users who care about this user's presence.
        
        Optimization: Only notify users who have this user's
        chat currently open (not all contacts).
        """
        # Get list of users with this chat open
        subscribers = await self.redis.smembers(
            f"presence_subscribers:{user_id}"
        )
        
        for subscriber_id in subscribers:
            await self.notifier.send_presence_update(
                subscriber_id,
                presence
            )
    
    async def subscribe_to_presence(
        self,
        subscriber_id: str,
        target_user_id: str
    ):
        """
        Subscribe to presence updates for a user.
        
        Called when user opens a chat.
        """
        await self.redis.sadd(
            f"presence_subscribers:{target_user_id}",
            subscriber_id
        )
        
        # Auto-expire subscription after 10 minutes
        await self.redis.expire(
            f"presence_subscribers:{target_user_id}",
            600
        )
    
    async def unsubscribe_from_presence(
        self,
        subscriber_id: str,
        target_user_id: str
    ):
        """
        Unsubscribe from presence updates.
        
        Called when user closes a chat.
        """
        await self.redis.srem(
            f"presence_subscribers:{target_user_id}",
            subscriber_id
        )

Phase 5: Scaling and Edge Cases

Interviewer: "WhatsApp grew from 0 to 500 million users in 5 years. How did they scale without rewriting everything?"

You: "Erlang's design made horizontal scaling almost automatic. But they still faced challenges..."

WhatsApp's Scaling Journey

WHATSAPP SCALING TIMELINE

2009: Launch
β”œβ”€β”€ Few servers
β”œβ”€β”€ Basic XMPP protocol
└── Thousands of users

2011: 1 million users
β”œβ”€β”€ Custom Erlang backend
β”œβ”€β”€ Modified ejabberd
└── ~10 servers

2013: 200 million users
β”œβ”€β”€ 2 million connections per server achieved
β”œβ”€β”€ Custom BEAM patches
β”œβ”€β”€ ~100 servers

2014 (acquisition): 500 million users
β”œβ”€β”€ 50 billion messages/day
β”œβ”€β”€ 550 servers
β”œβ”€β”€ 32 backend engineers

2020: 2 billion users
β”œβ”€β”€ 100+ billion messages/day
β”œβ”€β”€ Thousands of servers
β”œβ”€β”€ Still running on Erlang

KEY INSIGHT:
The architecture scaled 2000x (1M β†’ 2B users) with
the same fundamental design. That's the power of
choosing the right tool for the job.

Edge Cases

EDGE CASE 1: Celebrity with 10M Followers Posts Status

Problem: Status update needs to reach 10 million people
Impact: Massive fan-out, could overwhelm system

Solution:
β”œβ”€β”€ Status updates are pull-based, not push
β”œβ”€β”€ User's device polls for updates when opening status tab
β”œβ”€β”€ Updates cached at edge servers
β”œβ”€β”€ Same update served to millions from cache
└── No per-follower push notification

EDGE CASE 2: Group with 1000 Members

Problem: Each message needs 1000 deliveries
Impact: 1 message = 1000x amplification

Solution:
β”œβ”€β”€ Sender Key protocol (not pairwise encryption)
β”œβ”€β”€ Message encrypted once, decrypted by all members
β”œβ”€β”€ Group metadata (membership) changes trigger rekey
β”œβ”€β”€ Stagger delivery to avoid thundering herd
└── Members who leave can't read future messages

EDGE CASE 3: Network Partition

Problem: User connected to Server A, recipient on Server B
         Network between A and B fails

Solution:
β”œβ”€β”€ Messages queued at Server A
β”œβ”€β”€ Periodic retry to deliver to Server B
β”œβ”€β”€ If partition heals, messages delivered
β”œβ”€β”€ If partition persists, messages expire after 30 days
└── User notified of delivery failures after timeout

EDGE CASE 4: Device Goes Underwater (Literally)

Problem: User's phone destroyed, never comes back online
Impact: Their offline queue grows forever

Solution:
β”œβ”€β”€ Messages expire after 30 days
β”œβ”€β”€ Expired messages deleted from server
β”œβ”€β”€ Sender never gets "delivered" status
β”œβ”€β”€ No infinite queue growth
└── New device = new session, old messages lost

Phase 6: Why 50 Engineers?

You: "The small team is the most impressive part. Here's why it worked..."

WHY 50 ENGINEERS COULD BUILD WHATSAPP

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚  1. TECHNOLOGY CHOICES REDUCED COMPLEXITY                               β”‚
β”‚  ───────────────────────────────────────                                β”‚
β”‚                                                                         β”‚
β”‚  Erlang's actor model:                                                  β”‚
β”‚  β”œβ”€β”€ No mutex/lock management (huge complexity source)                  β”‚
β”‚  β”œβ”€β”€ Built-in distribution (no Kafka/RabbitMQ to manage)                β”‚
β”‚  β”œβ”€β”€ Hot code reloading (no complex deployment pipelines)               β”‚
β”‚  └── Supervision trees (automatic failure recovery)                     β”‚
β”‚                                                                         β”‚
β”‚  FreeBSD:                                                               β”‚
β”‚  β”œβ”€β”€ Simpler than Linux (one distribution)                              β”‚
β”‚  β”œβ”€β”€ Better network stack for their use case                            β”‚
β”‚  └── Team had deep expertise (ex-Yahoo)                                 β”‚
β”‚                                                                         β”‚
β”‚  ─────────────────────────────────────────────────────────────────────  β”‚
β”‚                                                                         β”‚
β”‚  2. FOCUS REDUCED SCOPE                                                 β”‚
β”‚  ──────────────────────                                                 β”‚
β”‚                                                                         β”‚
β”‚  What WhatsApp DIDN'T build:                                            β”‚
β”‚  β”œβ”€β”€ Advertising system (no ads = no ad tech team)                      β”‚
β”‚  β”œβ”€β”€ Social feed (no feed ranking team)                                 β”‚
β”‚  β”œβ”€β”€ Recommendation engine (no ML team)                                 β”‚
β”‚  β”œβ”€β”€ Third-party integrations (no platform team)                        β”‚
β”‚  β”œβ”€β”€ Multiple products (no product sprawl)                              β”‚
β”‚  └── Enterprise features (initially)                                    β”‚
β”‚                                                                         β”‚
β”‚  Every feature NOT built = engineers NOT needed                         β”‚
β”‚                                                                         β”‚
β”‚  ─────────────────────────────────────────────────────────────────────  β”‚
β”‚                                                                         β”‚
β”‚  3. COMMUNICATION OVERHEAD IS QUADRATIC                                 β”‚
β”‚  ─────────────────────────────────────                                  β”‚
β”‚                                                                         β”‚
β”‚  Team of 50: 50 Γ— 49 / 2 = 1,225 communication paths                    β”‚
β”‚  Team of 500: 500 Γ— 499 / 2 = 124,750 paths                             β”‚
β”‚                                                                         β”‚
β”‚  Larger teams spend more time coordinating than coding.                 β”‚
β”‚  Small teams move FASTER for well-scoped problems.                      β”‚
β”‚                                                                         β”‚
β”‚  ─────────────────────────────────────────────────────────────────────  β”‚
β”‚                                                                         β”‚
β”‚  4. DEEP EXPERTISE > BROAD KNOWLEDGE                                    β”‚
β”‚  ───────────────────────────────────                                    β”‚
β”‚                                                                         β”‚
β”‚  Every engineer knew:                                                   β”‚
β”‚  β”œβ”€β”€ The entire codebase                                                β”‚
β”‚  β”œβ”€β”€ Erlang deeply (not just superficially)                             β”‚
β”‚  β”œβ”€β”€ FreeBSD kernel internals                                           β”‚
β”‚  └── The complete message flow                                          β”‚
β”‚                                                                         β”‚
β”‚  No "that's not my service" mentality.                                  β”‚
β”‚  Anyone could debug any issue.                                          β”‚
β”‚                                                                         β”‚
β”‚  ─────────────────────────────────────────────────────────────────────  β”‚
β”‚                                                                         β”‚
β”‚  5. CODE REVIEW BY FOUNDERS                                             β”‚
β”‚  ──────────────────────────                                             β”‚
β”‚                                                                         β”‚
β”‚  In early days, Jan Koum reviewed every line of backend code.           β”‚
β”‚  This ensured:                                                          β”‚
β”‚  β”œβ”€β”€ Consistent style                                                   β”‚
β”‚  β”œβ”€β”€ No unnecessary complexity                                          β”‚
β”‚  β”œβ”€β”€ Deep understanding by leadership                                   β”‚
β”‚  └── Quality over velocity                                              β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Interview Conclusion

Interviewer: "Excellent coverage. Quick questions:"

Interviewer: "If you were starting today, would you still use Erlang?"

You: "It depends. Erlang is still excellent for this use case, but:

Advantages of Erlang:

  • Perfect for massive concurrency
  • Battle-tested at WhatsApp scale
  • 'Let it crash' philosophy reduces code complexity

Disadvantages:

  • Smaller talent pool (hard to hire)
  • Less library ecosystem than Go/Rust
  • Learning curve for new engineers

Modern alternatives:

  • Go: Good concurrency, larger talent pool
  • Rust: Performance, growing ecosystem
  • Elixir: Erlang VM with nicer syntax

For a startup, I'd probably choose Go for the talent pool. For WhatsApp's specific requirements, Erlang is still hard to beat."

Interviewer: "What's the hardest part of building a messaging system?"

You: "Three things:

  1. Exactly-once delivery semantics β€” Messages should arrive once, in order, or clearly fail. Network unreliability makes this hard.

  2. End-to-end encryption at scale β€” Key management, device verification, group re-keying. Most teams underestimate this complexity.

  3. Presence at scale β€” The 'chattiest' part of the system. Easy to ignore, hard to scale. Most outages are presence-related."


Summary: Concepts Applied from 10-Week Course

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚             CONCEPTS FROM 10-WEEK COURSE IN WHATSAPP DESIGN             β”‚
β”‚                                                                         β”‚
β”‚  WEEK 1: DATA AT SCALE                                                  β”‚
β”‚  β”œβ”€β”€ Partitioning: Users partitioned across "islands"                   β”‚
β”‚  β”œβ”€β”€ Replication: Each island has a replica                             β”‚
β”‚  └── In-memory storage: ETS/Mnesia for speed                            β”‚
β”‚                                                                         β”‚
β”‚  WEEK 2: FAILURE-FIRST DESIGN                                           β”‚
β”‚  β”œβ”€β”€ Let it crash: Erlang supervision trees                             β”‚
β”‚  β”œβ”€β”€ Process isolation: One crash doesn't affect others                 β”‚
β”‚  β”œβ”€β”€ Circuit breakers: Island isolation                                 β”‚ 
β”‚  └── Retry: Message redelivery for transient failures                   β”‚
β”‚                                                                         β”‚
β”‚  WEEK 3: MESSAGING & ASYNC                                              β”‚
β”‚  β”œβ”€β”€ Offline queues: Messages stored until delivered                    β”‚
β”‚  β”œβ”€β”€ At-least-once delivery: With deduplication at client               β”‚
β”‚  └── Backpressure: Queue size limits                                    β”‚
β”‚                                                                         β”‚
β”‚  WEEK 4: CACHING                                                        β”‚
β”‚  β”œβ”€β”€ Presence caching: In-memory with TTL                               β”‚
β”‚  β”œβ”€β”€ Write-back cache: For offline message persistence                  β”‚
β”‚  └── Edge caching: Media and status updates                             β”‚
β”‚                                                                         β”‚
β”‚  WEEK 5: CONSISTENCY                                                    β”‚
β”‚  β”œβ”€β”€ Eventual consistency: Presence can be stale                        β”‚
β”‚  β”œβ”€β”€ At-least-once: Messages, with client dedup                         β”‚
β”‚  └── Message ordering: Per-conversation, not global                     β”‚
β”‚                                                                         β”‚
β”‚  WEEK 9: SECURITY                                                       β”‚
β”‚  β”œβ”€β”€ End-to-end encryption: Signal Protocol                             β”‚
β”‚  β”œβ”€β”€ Forward secrecy: Double Ratchet algorithm                          β”‚
β”‚  β”œβ”€β”€ Key management: X3DH key agreement                                 β”‚
β”‚  └── Zero-knowledge: Server can't read messages                         β”‚
β”‚                                                                         β”‚
β”‚  WEEK 10: OPERATIONS                                                    β”‚
β”‚  β”œβ”€β”€ Hot code reloading: Deploy without restart                         β”‚
β”‚  β”œβ”€β”€ Observability: Per-process stats in BEAM                           β”‚
β”‚  └── Small team: Simplicity enables velocity                            β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why WhatsApp Matters

╔═════════════════════════════════════════════════════════════════════════╗
β•‘                                                                         β•‘
β•‘                   WHY WHATSAPP IS AN ENGINEERING MARVEL                 β•‘
β•‘                                                                         β•‘
β•‘  EFFICIENCY                                                             β•‘
β•‘  ──────────                                                             β•‘
β•‘  β€’ 2 million connections per server (20x industry norm)                 β•‘
β•‘  β€’ 50 billion messages/day with 32 engineers                            β•‘
β•‘  β€’ $19 billion valuation with 55 employees                              β•‘
β•‘                                                                         β•‘
β•‘  RELIABILITY                                                            β•‘
β•‘  ───────────                                                            β•‘
β•‘  β€’ 99.99%+ uptime                                                       β•‘
β•‘  β€’ Messages reliably delivered worldwide                                β•‘
β•‘  β€’ Works on 2G networks in developing countries                         β•‘
β•‘                                                                         β•‘
β•‘  PRIVACY                                                                β•‘
β•‘  ───────                                                                β•‘
β•‘  β€’ End-to-end encryption by default                                     β•‘
β•‘  β€’ Even WhatsApp can't read your messages                               β•‘
β•‘  β€’ Signal Protocol: Gold standard for secure messaging                  β•‘
β•‘                                                                         β•‘
β•‘  SIMPLICITY                                                             β•‘
β•‘  ──────────                                                             β•‘
β•‘  β€’ No ads, no social feed, no distractions                              β•‘
β•‘  β€’ One feature done exceptionally well                                  β•‘
β•‘  β€’ Focus enabled small team to move fast                                β•‘
β•‘                                                                         β•‘
β•‘  ════════════════════════════════════════════════════════════════════   β•‘
β•‘                                                                         β•‘
β•‘  "The best systems are simple enough that a small team can              β•‘
β•‘   understand them completely, yet powerful enough to serve              β•‘
β•‘   billions of users."                                                   β•‘
β•‘                                                                         β•‘
β•‘  WhatsApp proved that architecture beats headcount.                     β•‘
β•‘                                                                         β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Self-Assessment Checklist

After studying this case study, you should be able to:

Architecture:

  • Explain why Erlang is well-suited for messaging
  • Design a connection handling system for millions of users
  • Implement offline message queuing and delivery

Distributed Systems:

  • Apply the "let it crash" philosophy
  • Design supervision trees for failure recovery
  • Handle network partitions gracefully

Security:

  • Explain Signal Protocol at a high level
  • Understand forward secrecy and its importance
  • Design key exchange for end-to-end encryption

Performance:

  • Optimize presence/typing for high frequency updates
  • Use TTL-based expiry for ephemeral data
  • Apply batching and rate limiting

Team Efficiency:

  • Understand how technology choices affect team size
  • Value simplicity over cleverness
  • Focus on doing one thing well

Sources

Architecture and Scale:

Erlang and FreeBSD:

End-to-End Encryption:

System Design Resources:


Further Reading

Official Documentation:

Conference Talks (Highly Recommended):

  • Rick Reed - "Scaling to Millions of Simultaneous Connections" (Erlang Factory 2012)
    • YouTube: Search "Rick Reed WhatsApp Erlang"
  • Rick Reed - "That's 'Billion' with a 'B': Scaling to the Next Level at WhatsApp" (Erlang Factory 2014)
    • Details on achieving 2M+ connections per server
  • Anton Lavrik - "A Reflection on Building the WhatsApp Server" (Code BEAM 2018)
    • Post-Facebook acquisition insights
  • Eugene Fooksman - "WhatsApp System Design" (Various conferences)

Engineering Blogs:

Erlang/BEAM Resources:

Cryptography and Security:

Books:

  • "Designing Data-Intensive Applications" by Martin Kleppmann
    • Chapters on messaging and distributed systems
  • "Programming Erlang" by Joe Armstrong (Erlang creator)
    • Understanding the language that powers WhatsApp
  • "Designing Elixir Systems with OTP" by James Edward Gray II
    • Modern take on BEAM VM patterns
  • "System Design Interview Vol 1 & 2" by Alex Xu
    • Messaging system design patterns

Related Systems to Study:

  • Telegram: Different architecture (MTProto protocol)
  • Signal: Open-source reference implementation
  • Discord: Modern take on real-time messaging
  • Slack: Enterprise messaging architecture

Research Papers:

  • "The Signal Protocol" - Formal security analysis papers
  • "Asynchronous Ratcheting Trees" - Group messaging improvements
  • "SoK: Secure Messaging" - Survey of secure messaging systems

Podcasts and Videos:

  • Software Engineering Daily: Episodes on messaging systems
  • InfoQ: Conference talks on distributed systems
  • Strange Loop Conference: Talks on Erlang/BEAM

End of Bonus Problem 2: WhatsApp Messaging

"50 engineers. 2 billion users. 100 billion messages. The right architecture makes the impossible possible."