Weeks 1-6 Capstone: Food Delivery Order System
A Complete System Design Interview Integrating All Concepts
The Interview Begins
You walk into the interview room at a food delivery company. The interviewer smiles and gestures to the whiteboard.
Interviewer: "Thanks for coming in today. We're going to work through a system design problem together. I'm interested in your thought process, so please think out loud. Feel free to ask questions — this is collaborative."
They write on the whiteboard:
╔══════════════════════════════════════════════════════════════════════╗
║ ║
║ Design a Food Delivery Order System ║
║ ║
║ We're building the core ordering platform for a food delivery ║
║ service like DoorDash or Uber Eats. ║
║ ║
║ Key capabilities needed: ║
║ - Customers browse restaurants and place orders ║
║ - Orders are routed to restaurants for preparation ║
║ - Drivers are assigned and tracked in real-time ║
║ - All parties receive status updates throughout ║
║ - Payments are processed reliably ║
║ ║
╚══════════════════════════════════════════════════════════════════════╝
Interviewer: "Take a moment to think about this. We have about 45 minutes. Where would you like to start?"
Phase 1: Requirements Clarification
Before diving in, you take a breath and start asking questions.
You: "Before I start designing, I'd like to clarify some requirements. First, what's our scale? How many orders per day are we handling?"
Interviewer: "We're one of the larger players. About 5 million orders per day, concentrated heavily during lunch and dinner peaks."
You: "So during peak hours — say 6-8 PM — we might see 3-4x the average rate. What's the geographic scope? Single country or global?"
Interviewer: "Focus on a single country for now, but the design should be extensible to multiple regions."
You: "For the order flow, once a customer places an order, what's the expected time until a driver is assigned?"
Interviewer: "We want driver assignment within 30 seconds of restaurant confirmation. The restaurant should see the order within 2 seconds of placement."
You: "What about payment? Do we handle payments ourselves or use a payment processor?"
Interviewer: "We use Stripe for payment processing, but we need to handle the complexity of: authorization at order time, capture when driver picks up, and potential refunds."
You: "Last question — how important is real-time tracking? Do customers need live driver location?"
Interviewer: "Yes, live tracking is essential. Customers expect to see driver location updating every few seconds."
You: "Perfect. Let me summarize the requirements."
Functional Requirements
1. ORDER PLACEMENT
- Customer browses restaurants (filtered by location, cuisine, rating)
- Customer adds items to cart and places order
- Order is validated (restaurant open, items available, address deliverable)
- Payment is authorized
2. ORDER PROCESSING
- Order sent to restaurant in real-time
- Restaurant confirms and provides prep time estimate
- System assigns optimal driver
- Driver accepts/rejects assignment
3. ORDER FULFILLMENT
- Track order status: placed → confirmed → preparing → ready → picked up → delivered
- Real-time driver location tracking
- ETA updates throughout
4. NOTIFICATIONS
- Customer: order confirmed, driver assigned, driver arriving, delivered
- Restaurant: new order, driver arriving for pickup
- Driver: new assignment, navigation updates
5. PAYMENTS
- Authorize at order placement
- Capture when driver picks up
- Handle refunds for cancellations or issues
Non-Functional Requirements
1. SCALE
- 5M orders/day → ~60 orders/sec average
- Peak: 200-300 orders/sec during dinner rush
- 100K concurrent users browsing
- 50K active drivers at peak
2. LATENCY
- Restaurant sees order: < 2 seconds
- Driver assignment: < 30 seconds from restaurant confirmation
- Location updates: every 3-5 seconds
- Order status API: < 100ms p99
3. RELIABILITY
- Orders must never be lost (durability)
- Payments must be exactly-once (no double charges)
- Driver assignment must be atomic (no double assignments)
4. AVAILABILITY
- 99.9% uptime (8.7 hours downtime/year max)
- Graceful degradation during partial failures
Phase 2: Back-of-Envelope Estimation
You: "Let me work through the numbers to understand what we're building."
Traffic Estimation
ORDER TRAFFIC
Daily orders: 5,000,000
Average rate: 5M / 86,400 = ~58 orders/sec
Peak multiplier: 4x during dinner (6-8 PM)
Peak rate: ~230 orders/sec
Per order, we have:
- 1 order creation
- 3-5 status updates
- 10-20 location updates (30 min delivery, every 3 sec = 600 updates,
but batched to 10-20 meaningful ones)
- 5-10 notifications
Total writes at peak: 230 × 30 = ~7,000 writes/sec
READ TRAFFIC
Browsing: 100K concurrent users
Each browsing = 10 requests/min
100K × 10 / 60 = ~17,000 reads/sec
Order tracking: Active orders at any time ≈ 100K
Each polling every 10 sec
100K / 10 = 10,000 reads/sec
Total reads at peak: ~30,000 reads/sec
Storage Estimation
ORDER DATA
Per order:
- Order record: 2 KB (items, addresses, metadata)
- Status history: 500 bytes (10 status changes × 50 bytes)
- Location history: 2 KB (20 points × 100 bytes)
- Total per order: ~5 KB
Daily storage: 5M × 5 KB = 25 GB/day
Yearly storage: 25 GB × 365 = ~9 TB/year
With 3x replication: ~27 TB/year
ACTIVE DATA (hot)
- Last 24 hours orders: 5M × 5 KB = 25 GB
- Active orders (in-flight): 100K × 5 KB = 500 MB
- Driver locations: 50K × 100 bytes = 5 MB
Infrastructure Estimation
┌─────────────────────────────────────────────────────────────────────────┐
│ ESTIMATION SUMMARY │
│ │
│ TRAFFIC │
│ ├── Peak orders: 230/sec │
│ ├── Peak writes: 7,000/sec │
│ └── Peak reads: 30,000/sec │
│ │
│ STORAGE │
│ ├── Daily new data: 25 GB │
│ ├── Hot data (24h): 25 GB │
│ └── Yearly growth: 9 TB │
│ │
│ INFRASTRUCTURE (rough) │
│ ├── API servers: 20-30 instances │
│ ├── Database: Primary + 3 read replicas │
│ ├── Cache (Redis): 32 GB cluster │
│ └── Message queue: Kafka with 32 partitions │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Phase 3: High-Level Design
You: "Now let me sketch out the high-level architecture."
System Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ HIGH-LEVEL ARCHITECTURE │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Customer │ │ Restaurant │ │ Driver │ │
│ │ App │ │ App │ │ App │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ CDN │ │
│ └──────┬──────┘ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ ALB │────▶│ Rate Limiter│ │
│ └──────┬──────┘ └─────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ Order API │ │ Driver API │ │Restaurant API│ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬───────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Kafka │ │
│ │ (Events) │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────────┬──────────┼──────────┬──────────┐ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Order │ │Driver│ │Payment│ │Notif.│ │Track │ │
│ │Worker│ │Match │ │Worker │ │Worker│ │Worker│ │
│ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │ │ │
│ └─────────┴─────────┴────┬────┴─────────┘ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ S3/Blob │ │
│ │ (Orders) │ │ (Cache) │ │ (History) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Component Breakdown
You: "Let me walk through each component."
1. API Layer
Order API: Handles order placement, status queries, cancellations
- Validates orders (restaurant hours, delivery address, payment)
- Writes to transactional outbox (Week 3 pattern)
- Returns order ID immediately
Driver API: Handles driver location updates, assignment acceptance
- High-frequency location updates (every 3-5 seconds)
- Assignment accept/reject
- Status updates (picked up, delivered)
Restaurant API: Order receipt, confirmation, prep time updates
- Real-time order notification via WebSocket
- Prep time estimates
- Order ready signal
2. Event-Driven Core (Kafka)
Topics:
orders.created- New orders for processingorders.status- Status change eventsdriver.location- Location updatesdriver.assignment- Assignment eventspayments.process- Payment commandsnotifications.send- Notification requests
3. Workers
Order Worker: Orchestrates order lifecycle via saga pattern (Week 5) Driver Matcher: Assigns optimal driver based on location, rating, load Payment Worker: Handles Stripe integration with idempotency (Week 2) Notification Worker: Multi-channel delivery (Week 6) Tracking Worker: Processes location updates, calculates ETAs
Phase 4: Deep Dives
Interviewer: "Great overview. Let's dive deeper. How do you ensure an order is never lost?"
Deep Dive 1: Reliable Order Processing (Week 3 - Transactional Outbox)
You: "This is critical. We use the transactional outbox pattern to ensure orders are never lost, even if Kafka is temporarily unavailable."
The Problem
WITHOUT OUTBOX PATTERN
Customer places order:
1. Write to database ✓ Success
2. Publish to Kafka ✗ Kafka down!
Result: Order in DB but never processed
Customer charged, no food delivered
This is unacceptable for a food delivery system.
The Solution
# services/order_service.py
# Applies: Week 3, Day 2 - Transactional Outbox
from dataclasses import dataclass
from datetime import datetime
from typing import Optional
import uuid
import json
@dataclass
class Order:
order_id: str
customer_id: str
restaurant_id: str
items: list
delivery_address: dict
total_amount: float
status: str
created_at: datetime
@dataclass
class OutboxMessage:
message_id: str
aggregate_type: str
aggregate_id: str
event_type: str
payload: dict
created_at: datetime
class OrderService:
"""
Order service with transactional outbox pattern.
Guarantees:
- Order and outbox message written atomically
- Events eventually published to Kafka
- No order is ever lost
"""
def __init__(self, db_pool, payment_service):
self.db = db_pool
self.payments = payment_service
async def place_order(self, request: dict) -> Order:
"""
Place an order with guaranteed event publishing.
Uses single transaction for order + outbox.
"""
order_id = str(uuid.uuid4())
# Authorize payment first (can fail, no order created yet)
auth_result = await self.payments.authorize(
customer_id=request["customer_id"],
amount=request["total_amount"],
idempotency_key=f"order:{order_id}:auth"
)
if not auth_result.success:
raise PaymentAuthorizationError(auth_result.error)
# Single transaction: order + outbox
async with self.db.transaction() as tx:
# Insert order
order = Order(
order_id=order_id,
customer_id=request["customer_id"],
restaurant_id=request["restaurant_id"],
items=request["items"],
delivery_address=request["delivery_address"],
total_amount=request["total_amount"],
status="placed",
created_at=datetime.utcnow()
)
await tx.execute("""
INSERT INTO orders (
order_id, customer_id, restaurant_id, items,
delivery_address, total_amount, status, created_at,
payment_auth_id
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
""", order.order_id, order.customer_id, order.restaurant_id,
json.dumps(order.items), json.dumps(order.delivery_address),
order.total_amount, order.status, order.created_at,
auth_result.authorization_id)
# Insert outbox message (same transaction!)
outbox_message = OutboxMessage(
message_id=str(uuid.uuid4()),
aggregate_type="Order",
aggregate_id=order_id,
event_type="OrderPlaced",
payload={
"order_id": order_id,
"customer_id": order.customer_id,
"restaurant_id": order.restaurant_id,
"items": order.items,
"total_amount": order.total_amount,
"payment_auth_id": auth_result.authorization_id,
},
created_at=datetime.utcnow()
)
await tx.execute("""
INSERT INTO outbox (
message_id, aggregate_type, aggregate_id,
event_type, payload, created_at
) VALUES ($1, $2, $3, $4, $5, $6)
""", outbox_message.message_id, outbox_message.aggregate_type,
outbox_message.aggregate_id, outbox_message.event_type,
json.dumps(outbox_message.payload), outbox_message.created_at)
# Transaction committed - order is durable
# Outbox publisher will send to Kafka eventually
return order
class OutboxPublisher:
"""
Publishes outbox messages to Kafka.
Runs as background process, polling for unpublished messages.
"""
def __init__(self, db_pool, kafka_producer):
self.db = db_pool
self.kafka = kafka_producer
async def poll_and_publish(self, batch_size: int = 100):
"""Poll outbox and publish to Kafka."""
async with self.db.transaction() as tx:
# Lock and fetch unpublished messages
messages = await tx.fetch("""
SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY created_at
LIMIT $1
FOR UPDATE SKIP LOCKED
""", batch_size)
for msg in messages:
topic = self._get_topic(msg["event_type"])
# Publish to Kafka
await self.kafka.send(
topic=topic,
key=msg["aggregate_id"].encode(),
value=json.dumps(msg["payload"]).encode(),
headers=[("message_id", msg["message_id"].encode())]
)
# Mark as published
await tx.execute("""
UPDATE outbox
SET published_at = NOW()
WHERE message_id = $1
""", msg["message_id"])
def _get_topic(self, event_type: str) -> str:
return {
"OrderPlaced": "orders.created",
"OrderConfirmed": "orders.status",
"OrderReady": "orders.status",
"OrderDelivered": "orders.status",
}.get(event_type, "orders.events")
Interviewer: "What happens if the outbox publisher crashes?"
You: "Since we mark messages as published only after Kafka confirms receipt, a crash means the message stays in the outbox. Next poll picks it up. We might publish twice, which is why downstream consumers must be idempotent."
Deep Dive 2: Payment Processing (Week 2 - Idempotency)
Interviewer: "Speaking of payments — how do you ensure we never double-charge a customer?"
You: "This is where idempotency is critical. We use idempotency keys for every payment operation."
The Problem
DOUBLE CHARGE SCENARIO
1. Customer places order
2. Payment authorized
3. Driver picks up food
4. We call Stripe to capture payment
5. Network timeout — did it succeed?
6. We retry capture
7. Customer charged twice!
This WILL happen at scale. We need to prevent it.
The Solution
# services/payment_service.py
# Applies: Week 2, Day 2 - Idempotency
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
from enum import Enum
import hashlib
class PaymentStatus(Enum):
PENDING = "pending"
AUTHORIZED = "authorized"
CAPTURED = "captured"
REFUNDED = "refunded"
FAILED = "failed"
@dataclass
class PaymentRecord:
payment_id: str
order_id: str
amount: float
status: PaymentStatus
idempotency_key: str
stripe_payment_intent_id: Optional[str]
created_at: datetime
updated_at: datetime
class PaymentService:
"""
Payment service with idempotency guarantees.
Every operation uses idempotency keys to ensure:
- Authorization happens exactly once
- Capture happens exactly once
- Refund happens exactly once
"""
def __init__(self, db_pool, stripe_client, redis_client):
self.db = db_pool
self.stripe = stripe_client
self.redis = redis_client
# Idempotency window (Week 2 concept)
self.idempotency_ttl = timedelta(hours=24)
async def authorize(
self,
customer_id: str,
amount: float,
idempotency_key: str
) -> AuthorizationResult:
"""
Authorize payment with idempotency.
Same idempotency_key always returns same result.
"""
# Check if we've seen this key before
existing = await self._get_idempotent_result(idempotency_key)
if existing:
return existing
# Try to acquire lock for this key
lock_acquired = await self._acquire_idempotency_lock(idempotency_key)
if not lock_acquired:
# Another request is processing - wait and return their result
return await self._wait_for_result(idempotency_key)
try:
# Create payment intent with Stripe
# Stripe also accepts idempotency key!
intent = await self.stripe.create_payment_intent(
amount=int(amount * 100), # cents
currency="usd",
customer=customer_id,
capture_method="manual", # Authorize only
idempotency_key=idempotency_key
)
result = AuthorizationResult(
success=True,
authorization_id=intent.id,
amount=amount
)
# Store result for future idempotent requests
await self._store_idempotent_result(idempotency_key, result)
return result
except StripeError as e:
result = AuthorizationResult(
success=False,
error=str(e)
)
await self._store_idempotent_result(idempotency_key, result)
return result
finally:
await self._release_idempotency_lock(idempotency_key)
async def capture(
self,
order_id: str,
authorization_id: str,
amount: float
) -> CaptureResult:
"""
Capture authorized payment.
Idempotency key derived from order_id ensures exactly-once capture.
"""
idempotency_key = f"capture:{order_id}"
existing = await self._get_idempotent_result(idempotency_key)
if existing:
return existing
lock_acquired = await self._acquire_idempotency_lock(idempotency_key)
if not lock_acquired:
return await self._wait_for_result(idempotency_key)
try:
# Check current state in our database
payment = await self._get_payment_by_order(order_id)
if payment and payment.status == PaymentStatus.CAPTURED:
# Already captured - return success
return CaptureResult(success=True, captured=True)
# Capture with Stripe (Stripe idempotency handles their side)
intent = await self.stripe.capture_payment_intent(
authorization_id,
amount_to_capture=int(amount * 100),
idempotency_key=idempotency_key
)
# Update our database
await self._update_payment_status(
order_id,
PaymentStatus.CAPTURED
)
result = CaptureResult(success=True, captured=True)
await self._store_idempotent_result(idempotency_key, result)
return result
except StripeError as e:
result = CaptureResult(success=False, error=str(e))
await self._store_idempotent_result(idempotency_key, result)
return result
finally:
await self._release_idempotency_lock(idempotency_key)
async def _acquire_idempotency_lock(self, key: str) -> bool:
"""Acquire distributed lock for idempotency key."""
lock_key = f"idem_lock:{key}"
return await self.redis.set(
lock_key,
"1",
nx=True, # Only if not exists
ex=30 # 30 second timeout
)
async def _get_idempotent_result(self, key: str):
"""Get stored result for idempotency key."""
result_key = f"idem_result:{key}"
data = await self.redis.get(result_key)
if data:
return self._deserialize_result(data)
return None
async def _store_idempotent_result(self, key: str, result):
"""Store result for future idempotent requests."""
result_key = f"idem_result:{key}"
await self.redis.setex(
result_key,
int(self.idempotency_ttl.total_seconds()),
self._serialize_result(result)
)
Deep Dive 3: Driver Assignment (Week 5 - Distributed Coordination)
Interviewer: "How do you assign drivers without double-assigning the same driver to multiple orders?"
You: "This requires distributed coordination. We can't have two orders grab the same driver simultaneously."
The Problem
DOUBLE ASSIGNMENT RACE CONDITION
Order A needs driver: Order B needs driver:
1. Query available drivers 1. Query available drivers
2. See Driver X is free 2. See Driver X is free
3. Assign Driver X to A 3. Assign Driver X to B
4. Driver X has two orders!
At 200 orders/sec during peak, this WILL happen.
The Solution
# services/driver_matcher.py
# Applies: Week 5, Day 5 - Leader Election & Coordination
from dataclasses import dataclass
from datetime import datetime
from typing import Optional, List
import asyncio
@dataclass
class Driver:
driver_id: str
location: tuple # (lat, lng)
status: str # available, assigned, busy
rating: float
current_order_id: Optional[str]
@dataclass
class AssignmentResult:
success: bool
driver_id: Optional[str] = None
error: Optional[str] = None
class DriverMatcher:
"""
Assigns drivers to orders with coordination guarantees.
Uses optimistic locking to prevent double assignment.
"""
def __init__(self, db_pool, redis_client, location_service):
self.db = db_pool
self.redis = redis_client
self.locations = location_service
async def assign_driver(
self,
order_id: str,
restaurant_location: tuple,
max_distance_km: float = 5.0
) -> AssignmentResult:
"""
Find and assign optimal driver for an order.
Uses optimistic locking to prevent race conditions.
"""
# Find candidate drivers near restaurant
candidates = await self._find_nearby_drivers(
restaurant_location,
max_distance_km
)
if not candidates:
return AssignmentResult(
success=False,
error="No available drivers nearby"
)
# Sort by score (distance, rating, etc.)
ranked = self._rank_drivers(candidates, restaurant_location)
# Try to assign, starting with best candidate
for driver in ranked:
result = await self._try_assign(order_id, driver)
if result.success:
return result
return AssignmentResult(
success=False,
error="All nearby drivers unavailable"
)
async def _try_assign(
self,
order_id: str,
driver: Driver
) -> AssignmentResult:
"""
Attempt to assign driver using optimistic locking.
Only succeeds if driver is still available.
"""
async with self.db.transaction() as tx:
# Lock the driver row and check status
current = await tx.fetchrow("""
SELECT driver_id, status, version
FROM drivers
WHERE driver_id = $1
FOR UPDATE
""", driver.driver_id)
if not current or current["status"] != "available":
# Driver no longer available
return AssignmentResult(success=False)
# Update driver status atomically
result = await tx.execute("""
UPDATE drivers
SET status = 'assigned',
current_order_id = $1,
version = version + 1,
assigned_at = NOW()
WHERE driver_id = $2
AND version = $3
AND status = 'available'
""", order_id, driver.driver_id, current["version"])
if result == "UPDATE 0":
# Concurrent modification - another order got them
return AssignmentResult(success=False)
# Create assignment record
await tx.execute("""
INSERT INTO assignments (
assignment_id, order_id, driver_id,
status, created_at
) VALUES (gen_random_uuid(), $1, $2, 'pending', NOW())
""", order_id, driver.driver_id)
# Successfully assigned!
return AssignmentResult(
success=True,
driver_id=driver.driver_id
)
async def _find_nearby_drivers(
self,
location: tuple,
max_distance_km: float
) -> List[Driver]:
"""Find available drivers within radius."""
# Use Redis GEO for fast spatial query
lat, lng = location
# Get driver IDs within radius
nearby_ids = await self.redis.georadius(
"driver_locations",
lng, lat,
max_distance_km,
unit="km",
count=50
)
if not nearby_ids:
return []
# Fetch driver details (only available ones)
drivers = await self.db.fetch("""
SELECT driver_id, status, rating
FROM drivers
WHERE driver_id = ANY($1)
AND status = 'available'
""", [d.decode() for d in nearby_ids])
return [
Driver(
driver_id=d["driver_id"],
location=await self.locations.get(d["driver_id"]),
status=d["status"],
rating=d["rating"],
current_order_id=None
)
for d in drivers
]
def _rank_drivers(
self,
drivers: List[Driver],
restaurant_location: tuple
) -> List[Driver]:
"""Rank drivers by assignment score."""
def score(driver: Driver) -> float:
distance = self._haversine(driver.location, restaurant_location)
# Lower distance is better, higher rating is better
return distance - (driver.rating * 0.5)
return sorted(drivers, key=score)
Deep Dive 4: Order Saga (Week 5 - Saga Pattern)
Interviewer: "Walk me through what happens if something fails mid-order — say the restaurant rejects the order after payment is authorized?"
You: "This is a classic saga pattern problem. We need compensation logic for each step."
The Order Saga
# services/order_saga.py
# Applies: Week 5, Day 2-3 - Saga Pattern
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import logging
logger = logging.getLogger(__name__)
class SagaStep(Enum):
VALIDATE_ORDER = "validate"
AUTHORIZE_PAYMENT = "authorize_payment"
NOTIFY_RESTAURANT = "notify_restaurant"
AWAIT_CONFIRMATION = "await_confirmation"
ASSIGN_DRIVER = "assign_driver"
CAPTURE_PAYMENT = "capture_payment"
COMPLETE = "complete"
class SagaStatus(Enum):
RUNNING = "running"
COMPLETED = "completed"
COMPENSATING = "compensating"
FAILED = "failed"
@dataclass
class OrderSagaState:
saga_id: str
order_id: str
current_step: SagaStep
status: SagaStatus
payment_auth_id: Optional[str] = None
driver_id: Optional[str] = None
compensation_reason: Optional[str] = None
class OrderSagaOrchestrator:
"""
Orchestrates order lifecycle as a saga.
Each step has a corresponding compensation action.
If any step fails, we run compensations in reverse order.
"""
def __init__(
self,
order_service,
payment_service,
restaurant_service,
driver_service,
notification_service,
saga_repo
):
self.orders = order_service
self.payments = payment_service
self.restaurants = restaurant_service
self.drivers = driver_service
self.notifications = notification_service
self.sagas = saga_repo
async def execute(self, order_id: str) -> OrderSagaState:
"""Execute order saga from start to completion."""
state = OrderSagaState(
saga_id=f"saga:{order_id}",
order_id=order_id,
current_step=SagaStep.VALIDATE_ORDER,
status=SagaStatus.RUNNING
)
await self.sagas.save(state)
try:
# Step 1: Validate order
state = await self._validate_order(state)
# Step 2: Authorize payment
state = await self._authorize_payment(state)
# Step 3: Notify restaurant
state = await self._notify_restaurant(state)
# Step 4: Wait for restaurant confirmation
# (This happens async via webhook/event)
state.current_step = SagaStep.AWAIT_CONFIRMATION
await self.sagas.save(state)
return state
except SagaStepError as e:
logger.error(f"Saga {state.saga_id} failed at {state.current_step}: {e}")
return await self._compensate(state, str(e))
async def on_restaurant_confirmed(self, order_id: str, prep_time_minutes: int):
"""Called when restaurant confirms order."""
state = await self.sagas.get_by_order(order_id)
if not state or state.status != SagaStatus.RUNNING:
return
try:
# Step 5: Assign driver
state = await self._assign_driver(state)
# Update order with driver info
await self.orders.update_status(
order_id,
"driver_assigned",
driver_id=state.driver_id
)
# Notify customer
await self.notifications.send(
user_id=state.customer_id,
type="driver_assigned",
template="driver_assigned",
variables={"driver_id": state.driver_id}
)
state.current_step = SagaStep.COMPLETE
state.status = SagaStatus.COMPLETED
await self.sagas.save(state)
except SagaStepError as e:
await self._compensate(state, str(e))
async def on_restaurant_rejected(self, order_id: str, reason: str):
"""Called when restaurant rejects order."""
state = await self.sagas.get_by_order(order_id)
if not state:
return
await self._compensate(state, f"Restaurant rejected: {reason}")
async def on_driver_picked_up(self, order_id: str):
"""Called when driver picks up order — capture payment."""
state = await self.sagas.get_by_order(order_id)
if not state or not state.payment_auth_id:
return
# Capture payment (idempotent)
result = await self.payments.capture(
order_id=order_id,
authorization_id=state.payment_auth_id,
amount=state.order_amount
)
if not result.success:
logger.error(f"Payment capture failed for {order_id}: {result.error}")
# Don't compensate here — food is already with driver
# Flag for manual review instead
async def _compensate(
self,
state: OrderSagaState,
reason: str
) -> OrderSagaState:
"""
Run compensation actions in reverse order.
Compensation order:
1. Release driver (if assigned)
2. Void payment authorization (if authorized)
3. Update order status to cancelled
4. Notify customer
"""
state.status = SagaStatus.COMPENSATING
state.compensation_reason = reason
await self.sagas.save(state)
logger.info(f"Starting compensation for saga {state.saga_id}")
# Compensate driver assignment
if state.driver_id:
try:
await self.drivers.release(state.driver_id)
logger.info(f"Released driver {state.driver_id}")
except Exception as e:
logger.error(f"Failed to release driver: {e}")
# Compensate payment authorization
if state.payment_auth_id:
try:
await self.payments.void_authorization(
authorization_id=state.payment_auth_id,
idempotency_key=f"void:{state.order_id}"
)
logger.info(f"Voided payment auth {state.payment_auth_id}")
except Exception as e:
logger.error(f"Failed to void payment: {e}")
# Update order status
await self.orders.update_status(
state.order_id,
"cancelled",
reason=reason
)
# Notify customer
await self.notifications.send(
user_id=state.customer_id,
type="order_cancelled",
template="order_cancelled",
variables={"reason": reason}
)
state.status = SagaStatus.FAILED
await self.sagas.save(state)
return state
async def _validate_order(self, state: OrderSagaState) -> OrderSagaState:
"""Validate order details."""
order = await self.orders.get(state.order_id)
# Check restaurant is open
restaurant = await self.restaurants.get(order.restaurant_id)
if not restaurant.is_open():
raise SagaStepError("Restaurant is closed")
# Check delivery address is in range
if not restaurant.delivers_to(order.delivery_address):
raise SagaStepError("Delivery address out of range")
state.current_step = SagaStep.AUTHORIZE_PAYMENT
state.customer_id = order.customer_id
state.order_amount = order.total_amount
await self.sagas.save(state)
return state
async def _authorize_payment(self, state: OrderSagaState) -> OrderSagaState:
"""Authorize payment."""
result = await self.payments.authorize(
customer_id=state.customer_id,
amount=state.order_amount,
idempotency_key=f"order:{state.order_id}:auth"
)
if not result.success:
raise SagaStepError(f"Payment failed: {result.error}")
state.payment_auth_id = result.authorization_id
state.current_step = SagaStep.NOTIFY_RESTAURANT
await self.sagas.save(state)
return state
async def _notify_restaurant(self, state: OrderSagaState) -> OrderSagaState:
"""Send order to restaurant."""
order = await self.orders.get(state.order_id)
await self.restaurants.send_order(
restaurant_id=order.restaurant_id,
order=order
)
state.current_step = SagaStep.AWAIT_CONFIRMATION
await self.sagas.save(state)
return state
async def _assign_driver(self, state: OrderSagaState) -> OrderSagaState:
"""Assign driver to order."""
order = await self.orders.get(state.order_id)
restaurant = await self.restaurants.get(order.restaurant_id)
result = await self.drivers.assign(
order_id=state.order_id,
restaurant_location=restaurant.location
)
if not result.success:
raise SagaStepError(f"Driver assignment failed: {result.error}")
state.driver_id = result.driver_id
state.current_step = SagaStep.COMPLETE
await self.sagas.save(state)
return state
Deep Dive 5: Real-Time Location & Caching (Week 4)
Interviewer: "With 50K active drivers updating location every 3 seconds, how do you handle that load?"
You: "This is a perfect caching problem. We use Redis GEO for spatial queries and smart cache strategies."
# services/location_service.py
# Applies: Week 4 - Caching, Week 1 - Hot Keys
from datetime import datetime, timedelta
from typing import Optional, List, Tuple
import asyncio
class LocationService:
"""
High-frequency location tracking with caching.
Scale: 50K drivers × 1 update/3sec = 17K updates/sec
Strategy:
- Redis GEO for current locations (hot data)
- Batch writes to PostgreSQL for history (cold data)
- Cache customer's driver location (read-heavy)
"""
def __init__(self, redis_client, db_pool):
self.redis = redis_client
self.db = db_pool
self._history_buffer = []
self._buffer_lock = asyncio.Lock()
async def update_location(
self,
driver_id: str,
lat: float,
lng: float,
timestamp: datetime
):
"""
Update driver location.
Hot path: Redis GEO (immediate)
Cold path: Batched PostgreSQL writes (async)
"""
# Update Redis GEO (immediate, for spatial queries)
await self.redis.geoadd(
"driver_locations",
lng, lat, driver_id
)
# Store current location with timestamp
await self.redis.hset(
f"driver:{driver_id}:location",
mapping={
"lat": str(lat),
"lng": str(lng),
"timestamp": timestamp.isoformat()
}
)
# Buffer for batch history write
async with self._buffer_lock:
self._history_buffer.append({
"driver_id": driver_id,
"lat": lat,
"lng": lng,
"timestamp": timestamp
})
# Flush if buffer is large
if len(self._history_buffer) >= 1000:
await self._flush_history()
async def get_driver_location(
self,
driver_id: str
) -> Optional[Tuple[float, float, datetime]]:
"""Get current location for a driver."""
data = await self.redis.hgetall(f"driver:{driver_id}:location")
if not data:
return None
return (
float(data[b"lat"]),
float(data[b"lng"]),
datetime.fromisoformat(data[b"timestamp"].decode())
)
async def get_location_for_customer(
self,
order_id: str,
driver_id: str
) -> Optional[dict]:
"""
Get driver location for customer tracking.
Uses cache to reduce Redis hits for popular orders.
"""
cache_key = f"tracking:{order_id}:location"
# Check cache first (5 second TTL)
cached = await self.redis.get(cache_key)
if cached:
return json.loads(cached)
# Get fresh location
location = await self.get_driver_location(driver_id)
if not location:
return None
lat, lng, timestamp = location
# Calculate ETA
order = await self._get_order(order_id)
eta = await self._calculate_eta(
(lat, lng),
order.delivery_address
)
result = {
"lat": lat,
"lng": lng,
"timestamp": timestamp.isoformat(),
"eta_minutes": eta
}
# Cache for 5 seconds
await self.redis.setex(
cache_key,
5,
json.dumps(result)
)
return result
async def find_nearby_drivers(
self,
lat: float,
lng: float,
radius_km: float,
limit: int = 50
) -> List[str]:
"""Find drivers within radius using Redis GEO."""
results = await self.redis.georadius(
"driver_locations",
lng, lat,
radius_km,
unit="km",
count=limit,
sort="ASC" # Closest first
)
return [r.decode() for r in results]
async def _flush_history(self):
"""Batch write location history to PostgreSQL."""
if not self._history_buffer:
return
buffer = self._history_buffer
self._history_buffer = []
# Batch insert
await self.db.executemany("""
INSERT INTO driver_location_history
(driver_id, lat, lng, recorded_at)
VALUES ($1, $2, $3, $4)
""", [
(h["driver_id"], h["lat"], h["lng"], h["timestamp"])
for h in buffer
])
Deep Dive 6: Rate Limiting & Backpressure (Week 1 & 3)
Interviewer: "What if someone tries to abuse the system — placing hundreds of orders?"
You: "We implement rate limiting at multiple layers and handle backpressure gracefully."
# services/rate_limiter.py
# Applies: Week 1, Day 3 - Rate Limiting; Week 3, Day 3 - Backpressure
from datetime import datetime
from typing import Tuple
class MultiLayerRateLimiter:
"""
Rate limiting at multiple layers.
Layers:
1. Per-user: 10 orders/hour
2. Per-IP: 100 requests/minute
3. Per-restaurant: 500 orders/hour (prevent overwhelming kitchen)
4. Global: System capacity protection
"""
def __init__(self, redis_client):
self.redis = redis_client
self.limits = {
"user_orders": {"count": 10, "window": 3600}, # 10/hour
"ip_requests": {"count": 100, "window": 60}, # 100/min
"restaurant_orders": {"count": 500, "window": 3600}, # 500/hour
"global_orders": {"count": 1000, "window": 1}, # 1000/sec
}
async def check_order_allowed(
self,
user_id: str,
ip_address: str,
restaurant_id: str
) -> Tuple[bool, str]:
"""
Check if order is allowed by all rate limits.
Returns (allowed, reason).
"""
# Check user limit
if not await self._check_limit(f"ratelimit:user:{user_id}", "user_orders"):
return (False, "Too many orders. Please try again later.")
# Check IP limit
if not await self._check_limit(f"ratelimit:ip:{ip_address}", "ip_requests"):
return (False, "Too many requests. Please slow down.")
# Check restaurant limit
if not await self._check_limit(f"ratelimit:restaurant:{restaurant_id}", "restaurant_orders"):
return (False, "Restaurant is very busy. Please try again soon.")
# Check global limit
if not await self._check_limit("ratelimit:global:orders", "global_orders"):
return (False, "System is busy. Please try again in a moment.")
return (True, "")
async def _check_limit(self, key: str, limit_type: str) -> bool:
"""Check if under rate limit using sliding window."""
config = self.limits[limit_type]
now = datetime.utcnow().timestamp()
window_start = now - config["window"]
# Sliding window counter with Redis
pipe = self.redis.pipeline()
pipe.zremrangebyscore(key, 0, window_start)
pipe.zcard(key)
pipe.zadd(key, {str(now): now})
pipe.expire(key, config["window"])
results = await pipe.execute()
current_count = results[1]
return current_count < config["count"]
class BackpressureManager:
"""
Handles system backpressure during overload.
Applies: Week 3, Day 3 - Backpressure
"""
def __init__(self, redis_client, kafka_admin):
self.redis = redis_client
self.kafka = kafka_admin
async def get_system_pressure(self) -> float:
"""
Calculate current system pressure (0-1).
Based on:
- Kafka consumer lag
- Worker queue depth
- Database connection usage
"""
# Get Kafka lag
lag = await self._get_kafka_lag()
max_acceptable_lag = 100000
lag_pressure = min(1.0, lag / max_acceptable_lag)
# Get worker queue depth
queue_depth = await self.redis.llen("order_processing_queue")
max_queue = 10000
queue_pressure = min(1.0, queue_depth / max_queue)
# Weighted average
return (lag_pressure * 0.6) + (queue_pressure * 0.4)
async def should_accept_order(self) -> Tuple[bool, str]:
"""
Determine if we should accept new orders.
Implements graceful degradation:
- pressure < 0.7: Accept all
- pressure 0.7-0.9: Reject new customers
- pressure > 0.9: Emergency mode
"""
pressure = await self.get_system_pressure()
if pressure < 0.7:
return (True, "normal")
if pressure < 0.9:
return (True, "degraded") # Maybe skip some features
return (False, "System at capacity. Please try again shortly.")
Phase 5: Scaling and Edge Cases
Interviewer: "How would this system scale to 10x the current load?"
Scaling Strategy
You: "Let me analyze what breaks at 10x and how we'd address it."
SCALING TO 10X (50M ORDERS/DAY)
Current: 5M orders/day, 230/sec peak
Target: 50M orders/day, 2,300/sec peak
BOTTLENECK ANALYSIS
Component │ Current Limit │ 10x Solution
─────────────────┼──────────────────┼────────────────────────────
API Servers │ 30 instances │ 300 instances (horizontal)
PostgreSQL Write │ 5K writes/sec │ Shard by region/restaurant
PostgreSQL Read │ 30K reads/sec │ More read replicas + cache
Redis │ 100K ops/sec │ Redis Cluster (10 shards)
Kafka │ 100K msgs/sec │ More partitions (320)
Location Updates │ 17K/sec │ Batch + Redis Cluster
SHARDING STRATEGY
Orders: Shard by city/region
- NYC orders → Shard 1
- LA orders → Shard 2
- etc.
Drivers: Shard by operating region
Restaurants: Shard by city
Cross-shard queries minimized:
- Orders are local to city
- Drivers operate in one city
- Restaurants are in one city
Edge Cases
Interviewer: "What edge cases do you need to handle?"
EDGE CASES
1. DRIVER GOES OFFLINE MID-DELIVERY
- Detect via missing location updates (>5 min)
- Auto-reassign to new driver
- Notify customer with new ETA
2. RESTAURANT CLOSES AFTER ORDER PLACED
- Saga compensation: void payment, notify customer
- Offer alternatives or credit
3. CUSTOMER CANCELS AFTER FOOD PREPARED
- If before pickup: full refund
- If after pickup: partial refund, driver keeps food
- Track cancellation rate per customer
4. PAYMENT CAPTURE FAILS AFTER PICKUP
- Food is already with driver — can't undo
- Queue for retry with exponential backoff
- After N failures: flag for manual review
- Don't block delivery
5. DOUBLE ORDER (USER CLICKS TWICE)
- Idempotency key prevents duplicate charges
- Dedup window: 5 minutes
6. DRIVER AT WRONG LOCATION
- Compare GPS with restaurant/customer address
- Alert if >500m off
- Allow manual override
Failure Scenarios
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| PostgreSQL primary | Health check | Write failures | Promote replica |
| Redis down | Connection errors | Cache miss, no rate limit | Fall back to DB |
| Kafka broker | Producer errors | Event delay | Buffer in memory, retry |
| Payment service | Timeouts | Can't place orders | Show "payment unavailable" |
| Location service | Missing updates | Stale ETAs | Show "last known" |
Phase 6: Monitoring and Operations
Interviewer: "How would you monitor this system in production?"
Key Metrics
BUSINESS METRICS
├── Orders placed/minute: Target 4,000, Alert if < 3,000
├── Order success rate: Target 98%, Alert if < 95%
├── Avg delivery time: Target 35 min, Alert if > 45 min
└── Driver utilization: Target 70%, Alert if < 50%
SYSTEM METRICS
├── API latency p99: Target 100ms, Alert if > 500ms
├── Kafka consumer lag: Target < 1000, Alert if > 10,000
├── Database connections: Target < 80%, Alert if > 90%
└── Payment success rate: Target 99%, Alert if < 97%
Monitoring Dashboard
┌─────────────────────────────────────────────────────────────────────────┐
│ FOOD DELIVERY OPERATIONS │
│ │
│ ORDER HEALTH │
│ ├── Orders/min: [████████░░] 3,847 │
│ ├── Success rate: [█████████░] 98.2% │
│ └── Avg delivery time: [███████░░░] 32 min │
│ │
│ SYSTEM HEALTH │
│ ├── API p99 latency: [███░░░░░░░] 87ms │
│ ├── Kafka lag: [█░░░░░░░░░] 342 │
│ └── DB connections: [██████░░░░] 62% │
│ │
│ PAYMENTS │
│ ├── Auth success: [█████████░] 99.1% │
│ ├── Capture success: [█████████░] 99.8% │
│ └── Pending captures: 127 │
│ │
│ DRIVER FLEET (50,234 active) │
│ ├── Available: [████████░░] 18,421 │
│ ├── On delivery: [███████░░░] 31,813 │
│ └── Avg orders/driver: 2.3 │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Alerting Strategy
CRITICAL (Page immediately):
- Order success rate < 90%
- Payment failures > 5%
- All drivers in city unavailable
- Database primary down
WARNING (Slack, investigate):
- Order success rate < 95%
- Kafka lag > 10,000
- Driver assignment > 60 sec avg
- Payment auth latency > 2s
INFO (Dashboard only):
- Peak traffic approaching
- New restaurant onboarding
- Scheduled maintenance window
Runbook: High Order Failure Rate
RUNBOOK: Order Success Rate Dropped
SYMPTOMS:
- Order success rate alert triggered
- Customer complaints in support queue
DIAGNOSIS:
1. Check by failure reason:
SELECT reason, COUNT(*) FROM order_failures
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY reason;
2. Check payment health:
- Stripe status page
- Payment auth latency metrics
3. Check restaurant connectivity:
- WebSocket connection count
- Restaurant app errors
RESOLUTION:
- If payment issue: Enable backup processor
- If restaurant app: Check for app store update issues
- If driver shortage: Expand driver radius
- If system overload: Enable degraded mode
Interview Conclusion
Interviewer: "Excellent work. You've covered a lot of ground. Let me ask one final question — if you had to build this from scratch, what would you build first?"
You: "I'd prioritize in this order:
- Core order flow with transactional outbox — this is the money path
- Payment integration with idempotency — can't lose money
- Restaurant notification — orders need to get to kitchens
- Basic driver assignment — start with simple nearest-driver
- Customer notifications — keep users informed
The saga orchestration, real-time tracking, and advanced driver matching can come in phase 2. The key is getting orders reliably from customers to restaurants first."
Interviewer: "Great prioritization. Any questions for me?"
You: "Yes — what's been the biggest operational challenge you've faced at scale? I'm curious what breaks that you don't expect."
Summary: Weeks 1-6 Concepts Applied
Week 1: Data at Scale
| Concept | Application |
|---|---|
| Partitioning | Orders sharded by region, drivers by city |
| Replication | PostgreSQL read replicas for order queries |
| Rate Limiting | Multi-layer: user, IP, restaurant, global |
| Hot Keys | Driver locations in Redis GEO |
Week 2: Failure-First Design
| Concept | Application |
|---|---|
| Timeouts | Restaurant confirmation timeout (5 min) |
| Idempotency | Payment auth/capture with idempotency keys |
| Circuit Breakers | Payment service circuit breaker |
| Retry Strategies | Exponential backoff for payment capture |
Week 3: Messaging and Async
| Concept | Application |
|---|---|
| Transactional Outbox | Order creation → Kafka publishing |
| Consumer Groups | Order workers, notification workers |
| Backpressure | System pressure monitoring |
| Dead Letter Queue | Failed payment captures |
Week 4: Caching
| Concept | Application |
|---|---|
| Cache-Aside | Driver location caching |
| Write-Behind | Location history batching |
| Cache Invalidation | TTL-based for tracking cache |
| Multi-tier | Redis for hot, PostgreSQL for cold |
Week 5: Consistency and Coordination
| Concept | Application |
|---|---|
| Saga Pattern | Order lifecycle orchestration |
| Compensation | Payment void, driver release on failure |
| Optimistic Locking | Driver assignment race prevention |
| Distributed Locks | Idempotency key locking |
Week 6: Notification Platform
| Concept | Application |
|---|---|
| Multi-channel | Push + SMS for order updates |
| Priority | Critical (payment) vs normal (status) |
| User Preferences | Notification opt-in/out |
| Real-time | WebSocket for restaurant orders |
Self-Assessment Checklist
After studying this capstone, verify you can:
- Design a transactional outbox for reliable event publishing
- Implement idempotency for payment operations
- Use optimistic locking to prevent race conditions
- Design a saga with compensation for multi-step workflows
- Implement rate limiting at multiple layers
- Handle backpressure during system overload
- Use Redis GEO for spatial queries
- Design caching strategies for high-frequency updates
- Estimate storage and traffic for a large-scale system
- Create monitoring dashboards and alerting strategies
- Analyze failure scenarios and recovery procedures
- Prioritize features for MVP vs later phases
This capstone integrates all concepts from Weeks 1-6 of the System Design Mastery Series. Use this as a template for approaching similar interview problems involving transactional workflows, real-time updates, and distributed coordination.