Himanshu Kukreja
0%
LearnSystem DesignWeek 5Interview Week 5 Multiplayer Game Matchmaking System
Capstone

Week 5 Capstone: Multiplayer Game Matchmaking System

🎯 A Real-World Problem Covering Everything You've Learned


The Interview Begins

You walk into the interview room at a gaming company. The interviewer, a Staff Engineer, smiles and gestures to the whiteboard.

Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions β€” this is meant to be collaborative."

They write on the whiteboard:

╔══════════════════════════════════════════════════════════════════════════╗
β•‘                                                                          β•‘
β•‘              Design a Multiplayer Game Matchmaking System                β•‘
β•‘                                                                          β•‘
β•‘   We're building a competitive multiplayer game. Players queue up        β•‘
β•‘   to find matches with others of similar skill. Once matched, they       β•‘
β•‘   connect to a game server and play.                                     β•‘
β•‘                                                                          β•‘
β•‘   Key requirements:                                                      β•‘
β•‘   - Match players of similar skill (ELO rating)                          β•‘
β•‘   - Support multiple game modes (1v1, 5v5, battle royale)                β•‘
β•‘   - Minimize wait time while maintaining fair matches                    β•‘
β•‘   - Handle players going offline during queue/matching                   β•‘
β•‘   - Update player ratings after each match                               β•‘
β•‘   - Support party queue (friends playing together)                       β•‘
β•‘                                                                          β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."


Phase 1: Requirements Clarification (5 minutes)

Before diving in, you take a breath and start asking questions. This is crucial β€” never assume.

Your Questions

You: "Before I start designing, I'd like to clarify a few requirements. First, what's the scale we're targeting? How many concurrent players?"

Interviewer: "At peak, we expect 2 million concurrent players in queue or in matches. We're targeting a global audience."

You: "And for match sizes β€” you mentioned 1v1, 5v5, and battle royale. What's the battle royale size?"

Interviewer: "Battle royale is 100 players. The 5v5 mode is our most popular, accounting for about 70% of matches."

You: "How quickly do players expect to find a match?"

Interviewer: "For 5v5, median wait time should be under 30 seconds. We'd accept up to 2 minutes for 95th percentile. Players get frustrated if they wait longer."

You: "What about skill matching? How strict should we be?"

Interviewer: "We use ELO ratings from 0-3000. Ideally, players in a match should be within 100 ELO of each other. But if wait time exceeds 60 seconds, we can expand the range. Quality versus speed trade-off."

You: "For party queue β€” can a high-skill player queue with a low-skill friend?"

Interviewer: "Yes, but we need to handle it carefully. The match should be fair for everyone. We typically use the highest-rated player in the party for matchmaking."

You: "What happens to ELO ratings after a match? Is it updated immediately?"

Interviewer: "Yes, ratings update immediately after the match ends. All 10 players in a 5v5 should see their new rating within seconds. This is important for competitive integrity β€” we can't have players see stale ratings."

You: "Finally, what about regional servers? Do players only match within their region?"

Interviewer: "We have game servers in 5 regions: NA, EU, APAC, SA, OCE. Players prefer low latency, so we match within region when possible. But during off-peak, we might cross-region match rather than have long waits."

You: "Perfect. Let me summarize the requirements."

Functional Requirements

1. QUEUE MANAGEMENT
   - Players can join/leave matchmaking queue
   - Support solo queue and party queue (2-5 players)
   - Multiple game modes: 1v1, 5v5, battle royale (100 players)
   - Players can queue for multiple modes simultaneously

2. MATCHMAKING
   - Match players within ELO range (Β±100 preferred, expand over time)
   - Balance teams by average ELO
   - Respect party groupings (keep friends on same team)
   - Consider player region for low latency

3. GAME SESSION
   - Allocate game server for matched players
   - Handle player confirmation (ready check)
   - Reconnect players who disconnect mid-match
   - Determine match outcome (win/loss/draw)

4. RATING UPDATES
   - Calculate new ELO for all players after match
   - Update ratings immediately (strong consistency)
   - Handle edge cases (disconnects, abandons)
   - Support rating decay for inactive players

Non-Functional Requirements

1. SCALE
   - 2M concurrent players
   - 200K matches created per minute (peak)
   - 10M rating updates per hour

2. LATENCY
   - Queue join: <100ms p99
   - Match found notification: <50ms p99
   - Rating update visible: <500ms p99
   - Median queue time: <30 seconds (5v5)

3. CONSISTENCY
   - Ratings: Strong consistency (no stale reads)
   - Queue state: At-least-once processing
   - Match assignment: Exactly-once (no double-booking)

4. AVAILABILITY
   - 99.9% uptime for matchmaking
   - Graceful degradation during partial outages
   - No lost players in queue during failures

Phase 2: Back of the Envelope Estimation (5 minutes)

You: "Let me work through the numbers to understand the scale."

Traffic Estimation

MATCHMAKING TRAFFIC

Concurrent players:              2,000,000
Players in queue (estimated):    ~400,000 (20% at any time)
Players in matches:              ~1,600,000

Match composition (5v5):
  Players per match:             10
  Active matches:                160,000

Match duration (average):        20 minutes
  Matches ending per minute:     160,000 / 20 = 8,000 matches/min
  New matches starting:          ~8,000 matches/min

Queue operations:
  Join queue:                    ~400K players / 30 sec avg wait
                                 = ~13,000 joins/sec
  Leave queue (matched):         ~13,000/sec
  Leave queue (cancel):          ~1,000/sec

Peak multiplier:                 3x
  Peak joins:                    ~40,000/sec

Storage Estimation

PLAYER DATA

Per player:
  player_id:                     16 bytes (UUID)
  username:                      32 bytes
  elo_rating:                    4 bytes (int)
  region:                        4 bytes
  last_match:                    8 bytes (timestamp)
  Total:                         ~100 bytes

Total players (registered):      50M
Player data:                     50M Γ— 100B = 5GB

MATCH HISTORY

Per match:
  match_id:                      16 bytes
  game_mode:                     4 bytes
  player_ids (10):               160 bytes
  elo_changes (10):              40 bytes
  outcome:                       20 bytes
  timestamps:                    16 bytes
  Total:                         ~300 bytes

Matches per day:                 8,000/min Γ— 60 Γ— 24 = 11.5M
Daily match data:                11.5M Γ— 300B = 3.5GB
Yearly (retained):               1.3TB

QUEUE STATE (in-memory)

Per queued player:
  player_id:                     16 bytes
  elo:                           4 bytes
  queue_time:                    8 bytes
  game_modes:                    8 bytes
  party_id:                      16 bytes
  region:                        4 bytes
  Total:                         ~60 bytes

Queue memory (400K players):     400K Γ— 60B = 24MB

Key Metrics Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ESTIMATION SUMMARY                                  β”‚
β”‚                                                                        β”‚
β”‚  TRAFFIC                                                               β”‚
β”‚  β”œβ”€β”€ Peak queue joins:         40,000 /second                          β”‚
β”‚  β”œβ”€β”€ Matches created:          ~130 /second                            β”‚
β”‚  └── Rating updates:           ~1,300 /second (10 per match)           β”‚
β”‚                                                                        β”‚
β”‚  STORAGE                                                               β”‚
β”‚  β”œβ”€β”€ Player profiles:          5 GB                                    β”‚
β”‚  β”œβ”€β”€ Match history (yearly):   1.3 TB                                  β”‚
β”‚  └── Queue state (memory):     24 MB                                   β”‚
β”‚                                                                        β”‚
β”‚  INFRASTRUCTURE (rough)                                                β”‚
β”‚  β”œβ”€β”€ Queue service:            10 instances (sharded by region)        β”‚
β”‚  β”œβ”€β”€ Matchmaker:               5 instances (one leader per region)     β”‚
β”‚  β”œβ”€β”€ Rating service:           10 instances                            β”‚
β”‚  β”œβ”€β”€ Database:                 3-node cluster (primary + 2 replicas)   β”‚
β”‚  └── Redis (queue state):      3-node cluster                          β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 3: High-Level Design (10 minutes)

You: "Now let me sketch out the high-level architecture."

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MATCHMAKING SYSTEM ARCHITECTURE                     β”‚
β”‚                                                                        β”‚
β”‚                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚                           β”‚   Players   β”‚                              β”‚
β”‚                           β”‚  (Clients)  β”‚                              β”‚
β”‚                           β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚                                  β”‚                                     β”‚
β”‚                                  β–Ό                                     β”‚
β”‚                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚                           β”‚     CDN     β”‚                              β”‚
β”‚                           β”‚  WebSocket  β”‚                              β”‚
β”‚                           β”‚   Gateway   β”‚                              β”‚
β”‚                           β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚                                  β”‚                                     β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚              β”‚                   β”‚                   β”‚                 β”‚
β”‚              β–Ό                   β–Ό                   β–Ό                 β”‚
β”‚       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚       β”‚    Queue    β”‚     β”‚  Matchmaker β”‚     β”‚   Rating    β”‚          β”‚
β”‚       β”‚   Service   β”‚     β”‚   Service   β”‚     β”‚   Service   β”‚          β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚              β”‚                   β”‚                   β”‚                 β”‚
β”‚              β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚                 β”‚
β”‚              β”‚         β”‚                   β”‚         β”‚                 β”‚
β”‚              β–Ό         β–Ό                   β–Ό         β–Ό                 β”‚
β”‚       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚       β”‚   Redis Cluster     β”‚       β”‚    PostgreSQL       β”‚            β”‚
β”‚       β”‚   (Queue State)     β”‚       β”‚    (Ratings/        β”‚            β”‚
β”‚       β”‚                     β”‚       β”‚     Matches)        β”‚            β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                                                        β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚              β”‚                                       β”‚                 β”‚
β”‚              β–Ό                                       β–Ό                 β”‚
β”‚       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚       β”‚    Kafka    β”‚                         β”‚    etcd     β”‚          β”‚
β”‚       β”‚  (Events)   β”‚                         β”‚  (Leader    β”‚          β”‚
β”‚       β”‚             β”‚                         β”‚  Election)  β”‚          β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                                                                        β”‚
β”‚       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚       β”‚ Game Server β”‚     β”‚ Game Server β”‚     β”‚ Game Server β”‚          β”‚
β”‚       β”‚  Pool (NA)  β”‚     β”‚  Pool (EU)  β”‚     β”‚  Pool (APAC)β”‚          β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Breakdown

You: "Let me walk through each component..."

1. WebSocket Gateway

Purpose: Maintain persistent connections with players

Key responsibilities:

  • WebSocket connection management
  • Authentication and session handling
  • Route messages to appropriate services
  • Push notifications (match found, rating updates)

Technology: Custom service with WebSocket support, horizontally scaled

2. Queue Service

Purpose: Manage player queue state

Key responsibilities:

  • Add/remove players from queue
  • Track queue time for each player
  • Handle party groupings
  • Publish events when queue state changes

Technology: Custom service + Redis for state

3. Matchmaker Service

Purpose: Create balanced matches from queued players

Key responsibilities:

  • Run matching algorithm periodically
  • Balance teams by ELO
  • Allocate game servers
  • Coordinate ready checks

Technology: Custom service with leader election (one per region)

4. Rating Service

Purpose: Manage player ratings with strong consistency

Key responsibilities:

  • Calculate ELO changes after matches
  • Update ratings transactionally
  • Serve rating queries
  • Handle rating decay

Technology: Custom service + PostgreSQL

Data Flow

You: "Let me trace through a typical matchmaking flow..."

MATCHMAKING FLOW (5v5)

Step 1: QUEUE JOIN
        Player clicks "Find Match"
        Client ──▢ Gateway ──▢ Queue Service
        Queue Service adds player to Redis sorted set (by ELO)
        
Step 2: MATCHMAKING (runs every 500ms)
        Matchmaker (leader) reads queue from Redis
        Finds 10 compatible players
        Creates provisional match
        
Step 3: READY CHECK
        Matchmaker ──▢ Gateway ──▢ 10 Players
        "Match found! Accept?"
        Players confirm (30 second timeout)
        
Step 4: MATCH CONFIRMATION
        All 10 accept:
          Matchmaker allocates game server
          Matchmaker writes match to PostgreSQL
          Matchmaker publishes MatchCreated event
        Some decline:
          Decliners removed from queue
          Accepters returned to front of queue
          
Step 5: GAME SESSION
        Players connect to game server
        Game plays out (20 minutes average)
        Game server reports outcome
        
Step 6: RATING UPDATE
        Game Server ──▢ Kafka ──▢ Rating Service
        Rating Service calculates new ELO
        Rating Service updates PostgreSQL (transaction)
        Rating Service publishes RatingUpdated events
        Gateway pushes new ratings to players

Phase 4: Deep Dives (20 minutes)

Interviewer: "Great high-level design. Let's dive deeper into a few areas. How do you ensure only one matchmaker is running per region?"


Deep Dive 1: Leader Election for Matchmaker (Week 5, Day 5)

You: "This is critical β€” if two matchmakers run simultaneously, players could be double-booked into different matches. I'd use leader election with fencing tokens."

The Problem

DOUBLE-BOOKING SCENARIO

Without proper leader election:

  Matchmaker A (thinks it's leader):
    Reads queue: [Player1, Player2, ... Player10]
    Creates Match-1 with these players
    
  Matchmaker B (also thinks it's leader):
    Reads queue: [Player1, Player2, ... Player10]
    Creates Match-2 with same players!
    
  Result: Players assigned to two matches
          β†’ Confusion, broken game state

The Solution

You: "I'd use etcd for leader election with fencing tokens. The matchmaker must include its fencing token when modifying queue state."

LEADER ELECTION FLOW

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MATCHMAKER LEADER ELECTION                          β”‚
β”‚                                                                        β”‚
β”‚  Region: NA                                                            β”‚
β”‚                                                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  β”‚ Matchmaker  β”‚     β”‚ Matchmaker  β”‚     β”‚ Matchmaker  β”‚               β”‚
β”‚  β”‚     A       β”‚     β”‚     B       β”‚     β”‚     C       β”‚               β”‚
β”‚  β”‚  (LEADER)   β”‚     β”‚  (standby)  β”‚     β”‚  (standby)  β”‚               β”‚
β”‚  β”‚  token=42   β”‚     β”‚             β”‚     β”‚             β”‚               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚         β”‚                   β”‚                   β”‚                      β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚                             β”‚                                          β”‚
β”‚                             β–Ό                                          β”‚
β”‚                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                   β”‚
β”‚                      β”‚    etcd     β”‚                                   β”‚
β”‚                      β”‚ /matchmaker β”‚                                   β”‚
β”‚                      β”‚ /na/leader  β”‚                                   β”‚
β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                   β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation

# Matchmaker Leader Election with Fencing Tokens

import etcd3
from dataclasses import dataclass
from typing import Optional
import asyncio
import time


@dataclass
class LeaderLease:
    """Leadership lease with fencing token."""
    token: int
    expires_at: float
    
    def is_valid(self) -> bool:
        return time.time() < self.expires_at


class MatchmakerLeaderElection:
    """
    Leader election for matchmaker service.
    
    Ensures only ONE matchmaker runs per region.
    Uses fencing tokens to prevent zombie leaders.
    
    Week 5, Day 5: Leader Election concepts applied
    """
    
    def __init__(
        self,
        etcd_host: str,
        region: str,
        node_id: str,
        lease_ttl: int = 10
    ):
        self.client = etcd3.client(host=etcd_host)
        self.region = region
        self.node_id = node_id
        self.lease_ttl = lease_ttl
        self.election_key = f"/matchmaker/{region}/leader"
        
        self.lease: Optional[LeaderLease] = None
        self.etcd_lease = None
        self.is_leader = False
    
    async def campaign(self) -> Optional[int]:
        """
        Campaign for leadership.
        
        Returns fencing token if elected, None otherwise.
        """
        # Create lease
        self.etcd_lease = self.client.lease(self.lease_ttl)
        
        # Try to become leader (atomic compare-and-set)
        success, responses = self.client.transaction(
            compare=[
                self.client.transactions.version(self.election_key) == 0
            ],
            success=[
                self.client.transactions.put(
                    self.election_key,
                    self.node_id.encode(),
                    lease=self.etcd_lease
                )
            ],
            failure=[]
        )
        
        if success:
            # We're the leader - get fencing token from key version
            value, metadata = self.client.get(self.election_key)
            fencing_token = metadata.mod_revision  # Monotonically increasing
            
            self.lease = LeaderLease(
                token=fencing_token,
                expires_at=time.time() + self.lease_ttl
            )
            self.is_leader = True
            
            # Start keepalive
            asyncio.create_task(self._keepalive_loop())
            
            return fencing_token
        
        return None
    
    async def _keepalive_loop(self):
        """Keep lease alive while we're leader."""
        while self.is_leader:
            try:
                self.etcd_lease.refresh()
                self.lease.expires_at = time.time() + self.lease_ttl
                await asyncio.sleep(self.lease_ttl / 3)
            except Exception as e:
                self.is_leader = False
                break
    
    def get_fencing_token(self) -> Optional[int]:
        """Get current fencing token."""
        if self.lease and self.lease.is_valid():
            return self.lease.token
        return None
    
    async def resign(self):
        """Resign from leadership."""
        self.is_leader = False
        if self.etcd_lease:
            self.etcd_lease.revoke()


class FencedQueueOperations:
    """
    Queue operations protected by fencing tokens.
    
    Redis operations include fencing token validation.
    Prevents zombie leaders from corrupting state.
    """
    
    def __init__(self, redis_client):
        self.redis = redis_client
        # Track highest token seen per region
        self.highest_token_key = "matchmaker:highest_token:{region}"
    
    async def remove_players_from_queue(
        self,
        region: str,
        player_ids: list[str],
        fencing_token: int
    ) -> bool:
        """
        Remove players from queue atomically with fencing.
        
        Only succeeds if fencing_token >= highest seen.
        """
        # Lua script for atomic fencing check + removal
        script = """
        local highest_key = KEYS[1]
        local queue_key = KEYS[2]
        local token = tonumber(ARGV[1])
        local player_count = tonumber(ARGV[2])
        
        -- Check fencing token
        local highest = tonumber(redis.call('GET', highest_key) or '0')
        if token < highest then
            return 0  -- Stale token, reject
        end
        
        -- Update highest token
        redis.call('SET', highest_key, token)
        
        -- Remove players from queue
        for i = 3, 2 + player_count do
            redis.call('ZREM', queue_key, ARGV[i])
        end
        
        return 1  -- Success
        """
        
        result = await self.redis.eval(
            script,
            keys=[
                self.highest_token_key.format(region=region),
                f"queue:{region}"
            ],
            args=[fencing_token, len(player_ids)] + player_ids
        )
        
        return result == 1

Edge Cases

Interviewer: "What happens if the leader crashes mid-match-creation?"

You: "The match creation is a saga. If the leader crashes, the saga either completes via compensation or a new leader takes over and cleans up."

LEADER CRASH RECOVERY

Scenario:
  Leader starts creating match
  Leader crashes after removing from queue but before writing match
  
Recovery:
  1. etcd lease expires (10 seconds)
  2. New leader elected with higher fencing token
  3. New leader checks for "orphaned" queue removals
  4. Players without match assignment are re-queued
  
Implementation:
  - Match creation is a saga with compensation
  - Queue removal records "pending_match_id"
  - If match not confirmed within timeout, players re-queued

Deep Dive 2: Rating Updates with Strong Consistency (Week 5, Day 1)

You: "Rating updates need strong consistency. Players must see their correct rating immediately after a match."

The Problem

STALE RATING PROBLEM

Without strong consistency:

  T=0:   Match ends, Player A won
         Server 1 updates rating: 1500 β†’ 1520
         
  T=1:   Player A checks profile (reads from replica)
         Replica hasn't received update yet
         Player A sees: 1500 (STALE!)
         
  T=2:   Player A queues for next match
         Queue uses stale rating: 1500
         Matchmaker creates unfair match!
         
  This violates competitive integrity.

The Solution

You: "I'd use read-your-writes consistency for the player who just finished a match, and linearizable reads for matchmaking."

CONSISTENCY STRATEGIES

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    RATING CONSISTENCY MODEL                             β”‚
β”‚                                                                         β”‚
β”‚  OPERATION              β”‚ CONSISTENCY      β”‚ IMPLEMENTATION             β”‚
β”‚  ───────────────────────┼──────────────────┼─────────────────────────   β”‚
β”‚  Rating update          β”‚ Linearizable     β”‚ Write to primary           β”‚
β”‚  Player views own ratingβ”‚ Read-your-writes β”‚ Read from primary if       β”‚
β”‚                         β”‚                  β”‚ recent update              β”‚
β”‚  Matchmaker reads ratingβ”‚ Linearizable     β”‚ Always read from primary   β”‚
β”‚  Leaderboard            β”‚ Eventual         β”‚ Read from replica (cached) β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation

# Rating Service with Strong Consistency

from dataclasses import dataclass
from typing import Optional, Dict, Tuple
from datetime import datetime, timedelta
import asyncpg
import redis


@dataclass
class RatingUpdate:
    """Result of a rating update."""
    player_id: str
    old_rating: int
    new_rating: int
    match_id: str
    updated_at: datetime


class RatingService:
    """
    Rating service with consistency guarantees.
    
    Week 5, Day 1: Consistency Models applied
    - Linearizable writes (always to primary)
    - Read-your-writes for recent updates
    - Causal consistency via version tracking
    """
    
    def __init__(
        self,
        primary_db: asyncpg.Pool,
        replica_db: asyncpg.Pool,
        redis_client: redis.Redis
    ):
        self.primary = primary_db
        self.replica = replica_db
        self.redis = redis_client
        
        # Track recent updates for read-your-writes
        self.recent_update_ttl = timedelta(seconds=30)
    
    async def update_ratings_for_match(
        self,
        match_id: str,
        winner_ids: list[str],
        loser_ids: list[str],
        is_draw: bool = False
    ) -> list[RatingUpdate]:
        """
        Update ratings for all players in a match.
        
        Uses transaction for atomicity.
        Linearizable consistency (writes to primary).
        """
        updates = []
        
        async with self.primary.acquire() as conn:
            async with conn.transaction():
                # Get current ratings
                all_player_ids = winner_ids + loser_ids
                ratings = await conn.fetch(
                    """
                    SELECT player_id, elo_rating 
                    FROM players 
                    WHERE player_id = ANY($1)
                    FOR UPDATE  -- Lock rows for consistent read-modify-write
                    """,
                    all_player_ids
                )
                
                rating_map = {r['player_id']: r['elo_rating'] for r in ratings}
                
                # Calculate new ratings (ELO formula)
                new_ratings = self._calculate_elo_changes(
                    winners={pid: rating_map[pid] for pid in winner_ids},
                    losers={pid: rating_map[pid] for pid in loser_ids},
                    is_draw=is_draw
                )
                
                # Update all ratings atomically
                for player_id, new_rating in new_ratings.items():
                    old_rating = rating_map[player_id]
                    
                    await conn.execute(
                        """
                        UPDATE players 
                        SET elo_rating = $1, 
                            last_match_at = NOW(),
                            rating_version = rating_version + 1
                        WHERE player_id = $2
                        """,
                        new_rating, player_id
                    )
                    
                    # Record update for read-your-writes
                    await self._record_recent_update(player_id)
                    
                    updates.append(RatingUpdate(
                        player_id=player_id,
                        old_rating=old_rating,
                        new_rating=new_rating,
                        match_id=match_id,
                        updated_at=datetime.utcnow()
                    ))
        
        return updates
    
    async def get_rating(
        self,
        player_id: str,
        require_fresh: bool = False
    ) -> Tuple[int, int]:
        """
        Get player's rating.
        
        Returns (rating, version).
        
        If require_fresh=True or player has recent update,
        reads from primary (linearizable).
        Otherwise reads from replica (eventually consistent).
        """
        # Check for recent update (read-your-writes)
        has_recent_update = await self._has_recent_update(player_id)
        
        if require_fresh or has_recent_update:
            # Read from primary for consistency
            result = await self.primary.fetchrow(
                "SELECT elo_rating, rating_version FROM players WHERE player_id = $1",
                player_id
            )
        else:
            # Read from replica (faster, eventual consistency)
            result = await self.replica.fetchrow(
                "SELECT elo_rating, rating_version FROM players WHERE player_id = $1",
                player_id
            )
        
        if result:
            return result['elo_rating'], result['rating_version']
        
        raise PlayerNotFoundError(player_id)
    
    async def get_ratings_for_matchmaking(
        self,
        player_ids: list[str]
    ) -> Dict[str, int]:
        """
        Get ratings for matchmaking.
        
        ALWAYS reads from primary for linearizable consistency.
        Matchmaker needs accurate ratings for fair matches.
        """
        results = await self.primary.fetch(
            "SELECT player_id, elo_rating FROM players WHERE player_id = ANY($1)",
            player_ids
        )
        
        return {r['player_id']: r['elo_rating'] for r in results}
    
    async def _record_recent_update(self, player_id: str):
        """Record that player has a recent rating update."""
        key = f"rating:recent_update:{player_id}"
        await self.redis.setex(
            key,
            int(self.recent_update_ttl.total_seconds()),
            "1"
        )
    
    async def _has_recent_update(self, player_id: str) -> bool:
        """Check if player has a recent rating update."""
        key = f"rating:recent_update:{player_id}"
        return await self.redis.exists(key)
    
    def _calculate_elo_changes(
        self,
        winners: Dict[str, int],
        losers: Dict[str, int],
        is_draw: bool,
        k_factor: int = 32
    ) -> Dict[str, int]:
        """Calculate new ELO ratings using standard formula."""
        # Average ratings
        winner_avg = sum(winners.values()) / len(winners)
        loser_avg = sum(losers.values()) / len(losers)
        
        # Expected scores
        exp_winner = 1 / (1 + 10 ** ((loser_avg - winner_avg) / 400))
        exp_loser = 1 - exp_winner
        
        # Actual scores
        if is_draw:
            actual_winner = 0.5
            actual_loser = 0.5
        else:
            actual_winner = 1.0
            actual_loser = 0.0
        
        # Calculate changes
        new_ratings = {}
        
        for pid, rating in winners.items():
            change = k_factor * (actual_winner - exp_winner)
            new_ratings[pid] = max(0, min(3000, round(rating + change)))
        
        for pid, rating in losers.items():
            change = k_factor * (actual_loser - exp_loser)
            new_ratings[pid] = max(0, min(3000, round(rating + change)))
        
        return new_ratings

Deep Dive 3: Match Creation Saga (Week 5, Day 2)

You: "Match creation is a multi-step process that can fail at any point. I'd use the saga pattern with compensation."

The Problem

MATCH CREATION STEPS (can fail at any point)

Step 1: Remove players from queue
Step 2: Send ready check to all players
Step 3: Wait for confirmations (timeout possible)
Step 4: Allocate game server
Step 5: Write match to database
Step 6: Notify players of server address

If Step 4 fails (no servers available):
  - Players already removed from queue
  - They've confirmed they're ready
  - But no match happens!
  
Need compensation to put players back in queue.

The Solution

MATCH CREATION SAGA

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MATCH CREATION SAGA                                 β”‚
β”‚                                                                        β”‚
β”‚  Step 1: Remove from queue                                             β”‚
β”‚          Compensation: Re-add to queue (front)                         β”‚
β”‚                                                                        β”‚
β”‚  Step 2: Send ready checks                                             β”‚
β”‚          Compensation: Cancel ready check UI                           β”‚
β”‚                                                                        β”‚
β”‚  Step 3: Collect confirmations                                         β”‚
β”‚          Compensation: Return declined players to queue                β”‚
β”‚          DECISION POINT: All confirmed? Continue : Partial compensate  β”‚
β”‚                                                                        β”‚
β”‚  Step 4: Allocate game server ← PIVOT POINT                            β”‚
β”‚          If fail: Full compensation (re-queue all)                     β”‚
β”‚          If success: Forward recovery only                             β”‚
β”‚                                                                        β”‚
β”‚  Step 5: Write match record                                            β”‚
β”‚          Forward recovery: Retry until success                         β”‚
β”‚                                                                        β”‚
β”‚  Step 6: Notify players                                                β”‚
β”‚          Forward recovery: Retry until success                         β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation

# Match Creation Saga

from dataclasses import dataclass, field
from typing import Optional, List, Dict
from enum import Enum
from datetime import datetime, timedelta
import asyncio


class MatchStatus(Enum):
    PENDING_CONFIRMATIONS = "pending_confirmations"
    ALLOCATING_SERVER = "allocating_server"
    CONFIRMED = "confirmed"
    CANCELLED = "cancelled"
    FAILED = "failed"


@dataclass
class MatchCandidate:
    """A potential match being formed."""
    match_id: str
    region: str
    game_mode: str
    team1_player_ids: List[str]
    team2_player_ids: List[str]
    status: MatchStatus
    created_at: datetime
    fencing_token: int
    
    # Saga state
    removed_from_queue: bool = False
    ready_checks_sent: bool = False
    confirmations: Dict[str, bool] = field(default_factory=dict)
    server_allocated: Optional[str] = None
    match_written: bool = False


class MatchCreationSaga:
    """
    Saga for creating a match.
    
    Week 5, Day 2: Saga Pattern with compensation
    
    Handles partial failures and ensures players
    are never left in limbo.
    """
    
    def __init__(
        self,
        queue_service,
        gateway_service,
        server_allocator,
        match_repository,
        fencing_token: int
    ):
        self.queue = queue_service
        self.gateway = gateway_service
        self.servers = server_allocator
        self.matches = match_repository
        self.fencing_token = fencing_token
    
    async def execute(self, candidate: MatchCandidate) -> MatchCandidate:
        """
        Execute the match creation saga.
        
        Returns the final state of the match candidate.
        """
        all_players = candidate.team1_player_ids + candidate.team2_player_ids
        
        try:
            # Step 1: Remove players from queue
            await self._step_remove_from_queue(candidate, all_players)
            
            # Step 2: Send ready checks
            await self._step_send_ready_checks(candidate, all_players)
            
            # Step 3: Wait for confirmations
            confirmed = await self._step_collect_confirmations(
                candidate, 
                all_players,
                timeout=timedelta(seconds=30)
            )
            
            if not confirmed:
                # Partial compensation - some players declined
                await self._compensate_unconfirmed(candidate)
                candidate.status = MatchStatus.CANCELLED
                return candidate
            
            # Step 4: Allocate game server (PIVOT POINT)
            try:
                server = await self._step_allocate_server(candidate)
            except NoServersAvailableError:
                # Full compensation - re-queue everyone
                await self._compensate_full(candidate, all_players)
                candidate.status = MatchStatus.FAILED
                return candidate
            
            # PIVOT PASSED - Forward recovery only from here
            candidate.server_allocated = server.address
            
            # Step 5: Write match to database (retry until success)
            await self._step_write_match(candidate)
            
            # Step 6: Notify players (retry until success)
            await self._step_notify_players(candidate, all_players)
            
            candidate.status = MatchStatus.CONFIRMED
            return candidate
            
        except Exception as e:
            # Unexpected error - full compensation
            await self._compensate_full(candidate, all_players)
            candidate.status = MatchStatus.FAILED
            raise
    
    async def _step_remove_from_queue(
        self,
        candidate: MatchCandidate,
        player_ids: List[str]
    ):
        """Step 1: Remove players from queue."""
        success = await self.queue.remove_players(
            region=candidate.region,
            player_ids=player_ids,
            fencing_token=self.fencing_token,
            pending_match_id=candidate.match_id
        )
        
        if not success:
            raise FencingError("Stale fencing token")
        
        candidate.removed_from_queue = True
    
    async def _step_send_ready_checks(
        self,
        candidate: MatchCandidate,
        player_ids: List[str]
    ):
        """Step 2: Send ready checks to all players."""
        await self.gateway.send_ready_checks(
            player_ids=player_ids,
            match_id=candidate.match_id,
            game_mode=candidate.game_mode
        )
        candidate.ready_checks_sent = True
    
    async def _step_collect_confirmations(
        self,
        candidate: MatchCandidate,
        player_ids: List[str],
        timeout: timedelta
    ) -> bool:
        """Step 3: Collect player confirmations."""
        deadline = datetime.utcnow() + timeout
        
        while datetime.utcnow() < deadline:
            # Get current confirmations
            confirmations = await self.gateway.get_confirmations(
                candidate.match_id
            )
            candidate.confirmations = confirmations
            
            # Check if all confirmed
            confirmed_count = sum(1 for c in confirmations.values() if c)
            declined_count = sum(1 for c in confirmations.values() if c is False)
            
            if confirmed_count == len(player_ids):
                return True
            
            if declined_count > 0:
                # Someone explicitly declined
                return False
            
            await asyncio.sleep(0.5)
        
        # Timeout - treat as decline
        return False
    
    async def _step_allocate_server(
        self,
        candidate: MatchCandidate
    ):
        """Step 4: Allocate game server (PIVOT)."""
        server = await self.servers.allocate(
            region=candidate.region,
            game_mode=candidate.game_mode,
            match_id=candidate.match_id
        )
        return server
    
    async def _step_write_match(self, candidate: MatchCandidate):
        """Step 5: Write match to database (forward recovery)."""
        max_retries = 10
        
        for attempt in range(max_retries):
            try:
                await self.matches.create(
                    match_id=candidate.match_id,
                    region=candidate.region,
                    game_mode=candidate.game_mode,
                    team1=candidate.team1_player_ids,
                    team2=candidate.team2_player_ids,
                    server_address=candidate.server_allocated
                )
                candidate.match_written = True
                return
            except Exception as e:
                if attempt == max_retries - 1:
                    raise
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
    
    async def _step_notify_players(
        self,
        candidate: MatchCandidate,
        player_ids: List[str]
    ):
        """Step 6: Notify players of match (forward recovery)."""
        # Retry until all notified
        for _ in range(10):
            try:
                await self.gateway.notify_match_ready(
                    player_ids=player_ids,
                    match_id=candidate.match_id,
                    server_address=candidate.server_allocated
                )
                return
            except Exception:
                await asyncio.sleep(1)
    
    async def _compensate_unconfirmed(self, candidate: MatchCandidate):
        """Compensate for unconfirmed players."""
        # Re-queue players who confirmed
        confirmed_players = [
            pid for pid, confirmed in candidate.confirmations.items()
            if confirmed
        ]
        
        if confirmed_players:
            await self.queue.add_players_priority(
                region=candidate.region,
                player_ids=confirmed_players,
                priority=True  # Put at front of queue
            )
        
        # Notify confirmed players match was cancelled
        await self.gateway.notify_match_cancelled(
            player_ids=confirmed_players,
            reason="Other players did not confirm"
        )
    
    async def _compensate_full(
        self,
        candidate: MatchCandidate,
        player_ids: List[str]
    ):
        """Full compensation - re-queue all players."""
        if candidate.removed_from_queue:
            await self.queue.add_players_priority(
                region=candidate.region,
                player_ids=player_ids,
                priority=True
            )
        
        if candidate.ready_checks_sent:
            await self.gateway.notify_match_cancelled(
                player_ids=player_ids,
                reason="Match creation failed"
            )

Deep Dive 4: Party Queue with Conflict Resolution (Week 5, Day 4)

You: "Party queue adds complexity β€” players might update their party membership concurrently. I'd use CRDTs for party state."

The Problem

PARTY CONFLICT SCENARIO

Alice creates party, invites Bob and Carol:
  Party: {Alice (leader), Bob, Carol}
  
Concurrent operations:
  - Alice kicks Carol (on her device)
  - Carol leaves party (on her device)
  - Bob invites Dave (on his device)
  
All three operations happen before any sync.

Questions:
  - What's the final party state?
  - What if Alice's kick and Carol's leave have different effects?
  - Is Dave in the party or not?

The Solution

You: "I'd model party membership as an OR-Set CRDT. Each add/remove is tagged, and add wins over concurrent remove."

PARTY STATE AS CRDT

Party modeled as OR-Set:
  - Add member = add to set with unique tag
  - Remove member = tombstone existing tags
  - Concurrent add+remove = add wins (member stays)

Example resolution:
  Initial: {Alice, Bob, Carol}
  
  Operation 1: Alice kicks Carol
    β†’ Tombstone Carol's tag
    
  Operation 2: Carol leaves
    β†’ Tombstone Carol's tag (same effect)
    
  Operation 3: Bob invites Dave
    β†’ Add Dave with new tag
    
  Final state after merge: {Alice, Bob, Dave}
  Carol is out (both operations removed her)
  Dave is in (no conflicting remove)

Implementation

# Party Queue with CRDT

from dataclasses import dataclass, field
from typing import Dict, Set, Optional
import uuid
import time


@dataclass
class PartyMember:
    """Member in a party with metadata."""
    player_id: str
    joined_at: float
    is_leader: bool


class PartyORSet:
    """
    Party membership using OR-Set CRDT.
    
    Week 5, Day 4: Conflict Resolution with CRDTs
    
    Handles concurrent joins/leaves/kicks without conflicts.
    Add-wins semantics ensures players aren't accidentally removed.
    """
    
    def __init__(self, party_id: str, creator_id: str):
        self.party_id = party_id
        self.leader_id = creator_id
        
        # OR-Set: player_id -> {tag -> (tombstoned, joined_at)}
        self.members: Dict[str, Dict[str, tuple]] = {}
        
        # Add creator as first member
        self._add_member(creator_id, is_leader=True)
    
    def _generate_tag(self) -> str:
        """Generate unique tag for this operation."""
        return f"{uuid.uuid4()}:{time.time()}"
    
    def _add_member(self, player_id: str, is_leader: bool = False):
        """Add a member with new unique tag."""
        tag = self._generate_tag()
        
        if player_id not in self.members:
            self.members[player_id] = {}
        
        self.members[player_id][tag] = (False, time.time(), is_leader)
    
    def add_member(self, player_id: str) -> bool:
        """Add a player to the party."""
        if self.get_member_count() >= 5:
            return False  # Party full
        
        self._add_member(player_id)
        return True
    
    def remove_member(self, player_id: str):
        """Remove a player (tombstone all their tags)."""
        if player_id in self.members:
            for tag in self.members[player_id]:
                tombstoned, joined_at, is_leader = self.members[player_id][tag]
                self.members[player_id][tag] = (True, joined_at, is_leader)
    
    def is_member(self, player_id: str) -> bool:
        """Check if player is currently a member."""
        if player_id not in self.members:
            return False
        
        # Member if any non-tombstoned tag exists
        return any(
            not tombstoned
            for tombstoned, _, _ in self.members[player_id].values()
        )
    
    def get_members(self) -> list[str]:
        """Get all current members."""
        return [
            pid for pid in self.members
            if self.is_member(pid)
        ]
    
    def get_member_count(self) -> int:
        """Get count of current members."""
        return len(self.get_members())
    
    def merge(self, other: 'PartyORSet'):
        """
        Merge with another party state.
        
        OR-Set merge:
        - Union of all tags
        - Tombstoned if tombstoned in either
        - Add wins: new tag from concurrent add survives
        """
        all_players = set(self.members.keys()) | set(other.members.keys())
        
        for player_id in all_players:
            self_tags = self.members.get(player_id, {})
            other_tags = other.members.get(player_id, {})
            
            merged_tags = {}
            all_tag_ids = set(self_tags.keys()) | set(other_tags.keys())
            
            for tag in all_tag_ids:
                if tag in self_tags and tag in other_tags:
                    # Both have this tag - tombstoned if either tombstoned
                    self_tomb, self_time, self_leader = self_tags[tag]
                    other_tomb, other_time, other_leader = other_tags[tag]
                    merged_tags[tag] = (
                        self_tomb or other_tomb,
                        max(self_time, other_time),
                        self_leader or other_leader
                    )
                elif tag in self_tags:
                    merged_tags[tag] = self_tags[tag]
                else:
                    merged_tags[tag] = other_tags[tag]
            
            if merged_tags:
                self.members[player_id] = merged_tags
    
    def to_dict(self) -> dict:
        """Serialize for storage/transmission."""
        return {
            "party_id": self.party_id,
            "leader_id": self.leader_id,
            "members": {
                pid: {
                    tag: {"tombstoned": t, "joined_at": j, "is_leader": l}
                    for tag, (t, j, l) in tags.items()
                }
                for pid, tags in self.members.items()
            }
        }
    
    @classmethod
    def from_dict(cls, data: dict) -> 'PartyORSet':
        """Deserialize from storage."""
        party = cls.__new__(cls)
        party.party_id = data["party_id"]
        party.leader_id = data["leader_id"]
        party.members = {
            pid: {
                tag: (info["tombstoned"], info["joined_at"], info["is_leader"])
                for tag, info in tags.items()
            }
            for pid, tags in data["members"].items()
        }
        return party


class PartyService:
    """
    Party management service with CRDT-based conflict resolution.
    """
    
    def __init__(self, redis_client):
        self.redis = redis_client
    
    async def create_party(self, creator_id: str) -> str:
        """Create a new party."""
        party_id = str(uuid.uuid4())
        party = PartyORSet(party_id, creator_id)
        
        await self._save_party(party)
        await self._set_player_party(creator_id, party_id)
        
        return party_id
    
    async def join_party(self, party_id: str, player_id: str) -> bool:
        """Join an existing party."""
        party = await self._load_party(party_id)
        if not party:
            return False
        
        if party.add_member(player_id):
            await self._save_party(party)
            await self._set_player_party(player_id, party_id)
            return True
        
        return False
    
    async def leave_party(self, party_id: str, player_id: str):
        """Leave a party."""
        party = await self._load_party(party_id)
        if not party:
            return
        
        party.remove_member(player_id)
        await self._save_party(party)
        await self._clear_player_party(player_id)
        
        # Check if party is now empty
        if party.get_member_count() == 0:
            await self._delete_party(party_id)
    
    async def merge_party_updates(
        self,
        party_id: str,
        remote_state: dict
    ):
        """
        Merge remote party state (from another server).
        
        Used for cross-region party sync.
        """
        local_party = await self._load_party(party_id)
        remote_party = PartyORSet.from_dict(remote_state)
        
        if local_party:
            local_party.merge(remote_party)
            await self._save_party(local_party)
        else:
            await self._save_party(remote_party)
    
    async def get_party_for_matchmaking(
        self,
        party_id: str
    ) -> Optional[list[str]]:
        """Get party members for matchmaking."""
        party = await self._load_party(party_id)
        if not party:
            return None
        
        return party.get_members()
    
    async def _save_party(self, party: PartyORSet):
        """Save party to Redis."""
        await self.redis.set(
            f"party:{party.party_id}",
            json.dumps(party.to_dict())
        )
    
    async def _load_party(self, party_id: str) -> Optional[PartyORSet]:
        """Load party from Redis."""
        data = await self.redis.get(f"party:{party_id}")
        if data:
            return PartyORSet.from_dict(json.loads(data))
        return None
    
    async def _delete_party(self, party_id: str):
        """Delete party from Redis."""
        await self.redis.delete(f"party:{party_id}")
    
    async def _set_player_party(self, player_id: str, party_id: str):
        """Set player's current party."""
        await self.redis.set(f"player_party:{player_id}", party_id)
    
    async def _clear_player_party(self, player_id: str):
        """Clear player's current party."""
        await self.redis.delete(f"player_party:{player_id}")

Phase 5: Scaling and Edge Cases (5 minutes)

Interviewer: "How would this system scale to 10x the current load?"

Scaling Strategy

You: "There are several scaling vectors to consider..."

Horizontal Scaling

CURRENT β†’ 10X SCALE

Queue Service:
  Current:   10 instances (1 per region shard)
  Scaled:    30 instances (3 per region for HA)
  
Matchmaker:
  Current:   5 leaders (1 per region)
  Scaled:    5 leaders + 10 standbys (hot failover)
  Note:      Cannot horizontally scale leaders (one per region)
             Instead, optimize matching algorithm
             
Rating Service:
  Current:   10 instances
  Scaled:    50 instances (stateless, scales linearly)
  
Database:
  Current:   3-node cluster
  Scaled:    Sharded by region (5 clusters Γ— 3 nodes)
  
Redis:
  Current:   3-node cluster
  Scaled:    5 clusters (1 per region)

Bottleneck Analysis

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    BOTTLENECK ANALYSIS                                 β”‚
β”‚                                                                        β”‚
β”‚  Component          β”‚ Current Limit    β”‚ Scaling Solution              β”‚
β”‚  ───────────────────┼──────────────────┼─────────────────────────────  β”‚
β”‚  Queue joins        β”‚ 40K/sec          β”‚ Shard by region + game mode   β”‚
β”‚  Matchmaker cycle   β”‚ 500ms/region     β”‚ Optimize algorithm, parallel  β”‚
β”‚  Rating updates     β”‚ 2K/sec           β”‚ Batch updates, async writes   β”‚
β”‚  WebSocket conns    β”‚ 500K/server      β”‚ Add gateway instances         β”‚
β”‚  Database writes    β”‚ 10K/sec          β”‚ Shard by player_id hash       β”‚
β”‚                                                                        β”‚
β”‚  FIRST BOTTLENECK: Matchmaker cycle time                               β”‚
β”‚  At 10x scale, 500ms not enough to process 4M queued players           β”‚
β”‚  Solution: Parallelize within region by ELO bands                      β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Edge Cases

Interviewer: "What are some edge cases we should handle?"

You: "Let me walk through several..."

Edge Case 1: Player Disconnects During Ready Check

Scenario: Player confirms ready, then disconnects before match starts

Handling:
  - 10-second grace period for reconnection
  - If not reconnected: treat as decline
  - Re-queue other confirmed players (priority)
  - Player who disconnected gets short queue ban (1 minute)

Edge Case 2: Rating Update During Queue

Scenario: Player queued at 1500 ELO, plays another mode, 
          rating updates to 1550 while still in first queue

Handling:
  - Queue stores snapshot of rating at join time
  - Matchmaker uses queue-time rating for matching
  - After match, new rating is used for next queue
  - Prevents gaming the system

Edge Case 3: Cross-Region Party

Scenario: Alice (NA) parties with Bob (EU)

Handling:
  - Party stored in both regions (CRDT sync)
  - When queuing, use lowest latency region for party
  - Show warning about potential latency for some members
  - Respect party leader's preference

Failure Scenarios

Failure Detection Impact Recovery
Matchmaker crashes etcd lease expires Queue builds up (30 sec) New leader elected, continues
Redis cluster fails Health checks Players can't queue Fallback to local memory, replay from Kafka
Database primary fails Connection errors Rating updates delayed Promote replica, replay pending updates
Game server crashes Heartbeat timeout Match lost Players returned to queue, no ELO change
Kafka unavailable Connection errors Events not published Local queue, retry when available

Phase 6: Monitoring and Operations

Interviewer: "How would you monitor this system in production?"

Key Metrics

You: "I would track metrics at multiple levels..."

Business Metrics

- Median queue time by region: Target <30s, Alert if >60s
- Match quality (ELO difference): Target <100, Alert if >200
- Player satisfaction (match completion rate): Target >95%

System Metrics

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MATCHMAKING DASHBOARD                               β”‚
β”‚                                                                        β”‚
β”‚  QUEUE HEALTH                                                          β”‚
β”‚  β”œβ”€β”€ Queue depth by region:        [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘] 400K players           β”‚
β”‚  β”œβ”€β”€ Median wait time:             [β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘] 28 seconds             β”‚
β”‚  └── P95 wait time:                [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘] 85 seconds             β”‚
β”‚                                                                        β”‚
β”‚  MATCHMAKER HEALTH                                                     β”‚
β”‚  β”œβ”€β”€ Matches created/min:          [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘] 8,200                  β”‚
β”‚  β”œβ”€β”€ Cycle time (p99):             [β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘] 450ms                  β”‚
β”‚  └── Failed matches/min:           [β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 12                     β”‚
β”‚                                                                        β”‚
β”‚  RATING SERVICE                                                        β”‚
β”‚  β”œβ”€β”€ Updates/sec:                  [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘] 1,350                  β”‚
β”‚  β”œβ”€β”€ Update latency (p99):         [β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 45ms                   β”‚
β”‚  └── Primary DB lag:               [β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 2ms                    β”‚
β”‚                                                                        β”‚
β”‚  LEADER ELECTION                                                       β”‚
β”‚  β”œβ”€β”€ Current leaders by region:    NAβœ“ EUβœ“ APACβœ“ SAβœ“ OCEβœ“              β”‚
β”‚  β”œβ”€β”€ Last failover:                2 hours ago                         β”‚
β”‚  └── Failover count (24h):         1                                   β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Alerting Strategy

CRITICAL (PagerDuty):
  - No matchmaker leader for region > 30 seconds
  - Queue depth growing > 10% per minute for 5 minutes
  - Rating update latency p99 > 5 seconds
  - Database replication lag > 10 seconds
  
WARNING (Slack):
  - Median queue time > 45 seconds
  - Match ELO spread > 150 average
  - Game server allocation failures > 1%
  - Kafka consumer lag > 1000 messages
  
INFO (Dashboard only):
  - Leader election occurred
  - New region peak queue depth
  - Daily match count records

Runbook Snippet

RUNBOOK: Matchmaker Leader Not Elected

SYMPTOMS:
  - Alert: "No matchmaker leader for NA > 30s"
  - Queue depth increasing
  - No new matches being created

DIAGNOSIS:
  1. Check etcd cluster health:
     $ etcdctl endpoint health
     
  2. Check matchmaker pods:
     $ kubectl get pods -l app=matchmaker
     
  3. Check etcd leader election key:
     $ etcdctl get /matchmaker/na/leader

RESOLUTION:
  1. If etcd unhealthy:
     - Check etcd pod logs
     - Restart etcd pods if necessary
     
  2. If matchmaker pods crashed:
     - Check for OOM or panic in logs
     - Scale up replicas
     
  3. If election stuck:
     - Delete stale lease: etcdctl lease revoke <lease-id>
     - Matchmakers will re-elect
     
  4. Escalate to: Platform Team

Interview Conclusion

Interviewer: "Excellent work. You've demonstrated strong understanding of distributed systems, particularly around leader election and consistency models. The saga pattern for match creation was well thought out. Any questions for me?"

You: "Thank you! I'm curious how you handle the matchmaking algorithm itself β€” do you use any ML to predict match quality? And how do you balance queue time versus match fairness in practice?"

Interviewer: "Great questions. We actually use a learned model to predict match outcomes and optimize for close games. The queue time trade-off is a constant tuning exercise β€” we have knobs for each region based on player population."


Summary: Week 5 Concepts Applied

Week 5: Consistency and Coordination

Day Concept Application in This Design
Day 1 Consistency Models Linearizable reads for matchmaking ratings, read-your-writes for player's own rating after match
Day 2 Saga Pattern Match creation as saga with compensation for failed/declined matches
Day 3 Workflow Orchestration Could use Temporal for complex match lifecycle (mentioned as alternative)
Day 4 Conflict Resolution Party membership as OR-Set CRDT for concurrent join/leave/kick
Day 5 Leader Election Single matchmaker leader per region with fencing tokens

Code Patterns Demonstrated

1. LEADER ELECTION WITH FENCING
   - MatchmakerLeaderElection class
   - FencedQueueOperations for safe state modification
   - etcd-based election with lease TTL

2. CONSISTENCY GUARANTEES
   - RatingService with read-your-writes
   - Linearizable reads for matchmaking
   - Eventual consistency for leaderboards

3. SAGA PATTERN
   - MatchCreationSaga with compensation
   - Pivot point at server allocation
   - Forward recovery after pivot

4. CRDT FOR CONFLICT RESOLUTION
   - PartyORSet for membership
   - Add-wins semantics
   - Merge operation for cross-region sync

5. FENCING TOKEN VALIDATION
   - Lua script in Redis for atomic check
   - Prevents zombie leader writes
   - Monotonically increasing tokens

Self-Assessment Checklist

After studying this capstone, you should be able to:

  • Explain why leader election is necessary for matchmaking
  • Implement fencing tokens to prevent zombie leader corruption
  • Design consistency models based on operation requirements
  • Use sagas for multi-step operations with compensation
  • Apply CRDTs for concurrent state modification
  • Choose between choreography and orchestration for workflows
  • Identify pivot points in saga design
  • Handle partial failures gracefully
  • Design monitoring for distributed coordination
  • Debug leader election and consistency issues

Key Interview Takeaways

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MATCHMAKING SYSTEM CHECKLIST                        β”‚
β”‚                                                                        β”‚
β”‚  βœ“ Leader election for matchmaker (one per region)                     β”‚
β”‚  βœ“ Fencing tokens prevent zombie leader writes                         β”‚
β”‚  βœ“ Saga pattern for match creation with compensation                   β”‚
β”‚  βœ“ Strong consistency for ratings (linearizable)                       β”‚
β”‚  βœ“ CRDTs for party state (conflict-free merging)                       β”‚
β”‚  βœ“ Read-your-writes for player viewing own rating                      β”‚
β”‚  βœ“ Graceful degradation during failures                                β”‚
β”‚  βœ“ Horizontal scaling where possible                                   β”‚
β”‚  βœ“ Clear monitoring and alerting strategy                              β”‚
β”‚                                                                        β”‚
β”‚  KEY TRADE-OFFS:                                                       β”‚
β”‚  β€’ Queue time vs match quality (configurable ELO range)                β”‚
β”‚  β€’ Strong vs eventual consistency (per-operation choice)               β”‚
β”‚  β€’ Single leader vs parallel processing (leader per region)            β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This capstone problem integrates all concepts from Week 5 of the System Design Mastery Series. Use this as a template for approaching similar interview problems involving distributed coordination.