Week 9 Capstone: Design a Multi-Tenant Enterprise SaaS Platform
π― A Complete System Design Interview Integrating Everything You've Learned
The Interview Begins
You walk into the interview room at a well-known B2B SaaS company. The interviewer, a Staff Engineer, greets you warmly.
Interviewer: "Thanks for coming in today. We're going to work through a system design problem that reflects challenges we face here. I want to see how you think through complex multi-tenant systems. Feel free to ask questions β this is collaborative."
They turn to the whiteboard and write:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Design: Enterprise Document Management Platform β
β β
β You're building a B2B SaaS platform where companies can: β
β β
β β’ Store and organize documents (contracts, policies, reports) β
β β’ Search across all their documents with full-text search β
β β’ Collaborate on documents with comments and version history β
β β’ Set granular permissions (who can view/edit/delete) β
β β’ Generate audit trails for compliance β
β β’ Export their data for compliance or migration β
β β
β Customer profile: β
β β’ 500 enterprise customers (tenants) β
β β’ Mix of US, EU, and APAC customers β
β β’ Some customers in regulated industries (healthcare, finance) β
β β’ Largest customer has 50,000 users and 10M documents β
β β’ Smallest customers have 50 users and 10K documents β
β β
β Key concerns from sales: β
β β’ EU customers asking about GDPR and data residency β
β β’ Healthcare customers need HIPAA compliance β
β β’ Enterprise customers want SSO and audit logs β
β β’ Several deals lost because "search was too slow" β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."
Phase 1: Requirements Clarification (5 minutes)
You take a breath and start asking questions.
You: "Before I dive into the design, I'd like to clarify some requirements. Let me start with scale and then move to compliance."
Clarifying Questions
You: "For scale β you mentioned 500 tenants with the largest having 10M documents. What's the average document size, and what's our total storage footprint?"
Interviewer: "Average document is about 500KB. Mix of PDFs, Word docs, and some larger files like presentations. Total storage across all tenants is about 50TB, growing 20% annually."
You: "What's our request volume? How many searches, uploads, downloads per day?"
Interviewer: "Peak hours see about 1,000 requests per second across all tenants. Searches are the most common operation β maybe 60% of traffic. Uploads are bursty, especially end of quarter when companies finalize contracts."
You: "For the EU data residency requirement β do EU customers need all their data in the EU, or just personal data?"
Interviewer: "Great question. Our legal team says all document content and metadata for EU customers must stay in EU. Some operational data like aggregate metrics can be global."
You: "For the regulated industries β are we targeting HIPAA certification, or just 'HIPAA-ready' architecture?"
Interviewer: "We want the architecture to support HIPAA. Full certification is a business decision, but the technical foundation must be there. Same for SOC 2."
You: "One more β for the largest customers, do they get dedicated infrastructure, or is everyone on shared infrastructure with isolation?"
Interviewer: "We want to offer both. Most customers on shared infrastructure with strong isolation. Enterprise tier customers can opt for dedicated resources at premium pricing."
You: "Perfect. Let me summarize the requirements."
Functional Requirements
1. DOCUMENT MANAGEMENT
βββ Upload documents (PDF, Word, Excel, images)
βββ Organize in folders/hierarchies
βββ Version history with rollback
βββ Preview and download
βββ Bulk operations (move, delete, export)
2. SEARCH
βββ Full-text search across document content
βββ Metadata search (author, date, tags)
βββ Filters and facets
βββ Search within folders
βββ < 500ms p99 latency for search
3. COLLABORATION
βββ Comments on documents
βββ @mentions and notifications
βββ Sharing with granular permissions
βββ Activity feed per document
βββ Real-time presence (who's viewing)
4. ACCESS CONTROL
βββ Role-based permissions (viewer, editor, admin)
βββ Folder-level and document-level permissions
βββ External sharing with expiring links
βββ SSO integration (SAML, OIDC)
βββ MFA support
5. COMPLIANCE
βββ Complete audit trail (who did what, when)
βββ Data export (GDPR portability)
βββ Data deletion (GDPR right to erasure)
βββ Retention policies
βββ Legal hold capability
Non-Functional Requirements
1. SCALE
βββ 500 tenants, growing to 2,000
βββ 50TB storage, growing 20% annually
βββ 1,000 requests/second peak
βββ Largest tenant: 50K users, 10M documents
βββ Support 100K concurrent users globally
2. LATENCY
βββ Search: < 500ms p99
βββ Document preview: < 2s p99
βββ Upload (10MB): < 5s p99
βββ API calls: < 200ms p99
3. AVAILABILITY
βββ 99.9% uptime SLA
βββ No single point of failure
βββ Graceful degradation
βββ < 4 hours RTO, < 1 hour RPO
4. COMPLIANCE
βββ EU data residency for EU customers
βββ HIPAA-ready architecture
βββ SOC 2 Type II controls
βββ GDPR compliance (consent, deletion, portability)
5. SECURITY
βββ Encryption at rest and in transit
βββ Tenant isolation (data and resources)
βββ Zero trust architecture
βββ Regular security audits
Phase 2: Back of the Envelope Estimation (5 minutes)
You: "Let me work through the numbers to validate our architecture decisions."
Storage Estimation
DOCUMENT STORAGE
Current state:
Total documents: ~100M (across all tenants)
Average document size: 500KB
Total storage: 100M Γ 500KB = 50TB
With metadata overhead: 50TB Γ 1.2 = 60TB
With 3x replication: 60TB Γ 3 = 180TB raw storage
Growth projection (20% annually):
Year 1: 60TB
Year 2: 72TB
Year 3: 86TB
Year 5: 150TB
Per-tenant storage:
Average tenant: 100GB (200K documents)
Largest tenant: 5TB (10M documents)
Smallest tenant: 5GB (10K documents)
Traffic Estimation
REQUEST TRAFFIC
Peak requests: 1,000/second
Daily requests: ~50M (assuming 8-hour peak)
Breakdown by operation:
βββ Search: 600/sec (60%)
βββ Read/download: 250/sec (25%)
βββ Write/upload: 100/sec (10%)
βββ Other (permissions, etc): 50/sec (5%)
Search index size:
Average document text: 10KB extracted text
Total index size: 100M Γ 10KB = 1TB
With Elasticsearch overhead: ~3TB
Infrastructure Estimation
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INFRASTRUCTURE ESTIMATE β
β β
β COMPUTE β
β βββ API servers: 20 instances (m5.xlarge) β
β βββ Search cluster: 9 nodes (r5.2xlarge) - 3 per region β
β βββ Background workers: 10 instances (m5.large) β
β βββ Total: ~40 instances β
β β
β STORAGE β
β βββ PostgreSQL: Multi-AZ, 2TB per region β
β βββ Elasticsearch: 3TB per region β
β βββ S3: 180TB (with replication) β
β βββ Redis: 50GB cluster β
β β
β REGIONS β
β βββ US (us-east-1): Primary for US customers β
β βββ EU (eu-central-1): Primary for EU customers β
β βββ APAC (ap-southeast-1): Primary for APAC customers β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Phase 3: High-Level Design (10 minutes)
You: "Now let me sketch out the high-level architecture. Given our multi-region, multi-tenant requirements, I'll design for regional data isolation with a global control plane."
System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HIGH-LEVEL ARCHITECTURE β
β β
β βββββββββββββββββββββββ β
β β Global Control β β
β β Plane β β
β β βββ Tenant Config β β
β β βββ Routing Rules β β
β β βββ Feature Flags β β
β ββββββββββββ¬βββββββββββ β
β β β
β βββββββββββββββββββββββββββΌββββββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β US Region β β EU Region β β APAC Region β β
β β β β β β β β
β β βββββββββββ β β βββββββββββ β β βββββββββββ β β
β β β CDN β β β β CDN β β β β CDN β β β
β β ββββββ¬βββββ β β ββββββ¬βββββ β β ββββββ¬βββββ β β
β β β β β β β β β β β
β β ββββββΌβββββ β β ββββββΌβββββ β β ββββββΌβββββ β β
β β β WAF β β β β WAF β β β β WAF β β β
β β ββββββ¬βββββ β β ββββββ¬βββββ β β ββββββ¬βββββ β β
β β β β β β β β β β β
β β ββββββΌβββββ β β ββββββΌβββββ β β ββββββΌβββββ β β
β β β ALB β β β β ALB β β β β ALB β β β
β β ββββββ¬βββββ β β ββββββ¬βββββ β β ββββββ¬βββββ β β
β β β β β β β β β β β
β β ββββββΌβββββ β β ββββββΌβββββ β β ββββββΌβββββ β β
β β β API β β β β API β β β β API β β β
β β β Cluster β β β β Cluster β β β β Cluster β β β
β β ββββββ¬βββββ β β ββββββ¬βββββ β β ββββββ¬βββββ β β
β β β β β β β β β β β
β β ββββββ΄βββββ β β ββββββ΄βββββ β β ββββββ΄βββββ β β
β β β β β β β β β β β β β β
β β βΌ βΌ β β βΌ βΌ β β βΌ βΌ β β
β ββββββ ββββββ ββββββ ββββββ ββββββ ββββββ β
β ββ PGβ β ESββ ββ PGβ β ESββ ββ PGβ β ESββ β
β ββββββ ββββββ ββββββ ββββββ ββββββ ββββββ β
β β β β β β β β β β β
β β ββββββ΄βββββ β β ββββββ΄βββββ β β ββββββ΄βββββ β β
β β β S3 β β β β S3 β β β β S3 β β β
β β β Bucket β β β β Bucket β β β β Bucket β β β
β β βββββββββββ β β βββββββββββ β β βββββββββββ β β
β β β β β β β β
β β US DATA β β EU DATA β β APAC DATA β β
β β ONLY β β ONLY β β ONLY β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β NO CROSS-REGION DATA FLOW FOR TENANT DATA β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Breakdown
You: "Let me walk through each major component and the key design decisions."
1. Global Control Plane
Purpose: Manage tenant configuration and routing without storing personal data.
Global Control Plane contains:
βββ Tenant registry (which region each tenant is in)
βββ Feature flags (which features enabled per tenant)
βββ Plan/quota configuration
βββ Routing rules
βββ Global admin interface
Does NOT contain:
βββ User data
βββ Document content
βββ Any personal information
βββ Audit logs (those stay regional)
Technology: Single PostgreSQL with read replicas in each region for low-latency lookups.
2. Regional Data Plane
Purpose: Store and process all tenant data within the region.
Each region has independent:
βββ API cluster (stateless, horizontally scaled)
βββ PostgreSQL (users, documents metadata, permissions)
βββ Elasticsearch (search index)
βββ S3 (document storage)
βββ Redis (cache, sessions, rate limiting)
βββ Kafka (event streaming, audit logs)
Key Decision: Complete data isolation. EU tenant data never leaves EU region.
3. Document Processing Pipeline
Document Upload Flow:
User uploads document
β
βΌ
βββββββββββββββββββββ
β API Gateway β β Validates, rate limits
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββ
β Upload Service β β Generates presigned S3 URL
βββββββββββ¬ββββββββββ
β
βββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββ βββββββββββββββββββββ
β S3 (encrypted) β β PostgreSQL β
β Store document β β Store metadata β
βββββββββββ¬ββββββββββ βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββ
β Processing Queue β β Async processing
βββββββββββ¬ββββββββββ
β
βββββββ΄ββββββ
β β
βΌ βΌ
ββββββββββ ββββββββββββ
β Text β βThumbnail β
βExtract β βGenerate β
βββββ¬βββββ ββββββββββββ
β
βΌ
βββββββββββββββββββββ
β Elasticsearch β β Index for search
β Update index β
βββββββββββββββββββββ
4. Search Architecture
Search Flow:
User searches "contract 2024"
β
βΌ
βββββββββββββββββββββ
β API Gateway β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββ
β Search Service β
β βββ Add tenant_id filter (CRITICAL!) β
β βββ Add permission filter β
β βββ Build ES query β
βββββββββββ¬ββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Elasticsearch β
β βββ Tenant-specific index OR. |
β βββ Filtered query with tenant_id |
βββββββββββ¬ββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββ
β Results with β
β permission check β
βββββββββββββββββββββ
Phase 4: Deep Dives (20 minutes)
Interviewer: "Great high-level design. Let's dive deeper into some areas. First, tell me about tenant isolation. How do you ensure one tenant can never see another's data?"
Deep Dive 1: Tenant Isolation (Week 9, Day 1)
You: "Tenant isolation is the most critical aspect of this system. I'll implement defense in depth with multiple isolation layers."
The Problem
WITHOUT PROPER ISOLATION:
Tenant A searches for "confidential"
β
βΌ
SELECT * FROM documents
WHERE content LIKE '%confidential%'
β
βΌ
Returns documents from ALL tenants!
Including Tenant B's confidential contracts.
This is a catastrophic data breach.
Multi-Layer Isolation Strategy
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TENANT ISOLATION LAYERS β
β β
β LAYER 1: REQUEST ROUTING β
β βββ Tenant determined from subdomain (acme.docplatform.com) β
β βββ Or from JWT token claims β
β βββ Set in request context at API gateway β
β βββ CANNOT be overridden by request parameters β
β β
β LAYER 2: APPLICATION ENFORCEMENT β
β βββ TenantContext set for every request β
β βββ Repository layer auto-filters by tenant_id β
β βββ Service layer validates tenant ownership β
β βββ Audit log records tenant context β
β β
β LAYER 3: DATABASE ENFORCEMENT (RLS) β
β βββ Row Level Security policies on all tables β
β βββ Database session has current_tenant set β
β βββ Even raw SQL queries are filtered β
β βββ DBA cannot accidentally see cross-tenant β
β β
β LAYER 4: SEARCH INDEX ISOLATION β
β βββ Option A: Index per tenant (strong isolation) β
β βββ Option B: Shared index with tenant_id filter β
β βββ We use Option A for enterprise, B for standard β
β βββ Search service enforces tenant filter β
β β
β LAYER 5: STORAGE ISOLATION β
β βββ S3 path includes tenant_id β
β βββ IAM policies restrict cross-tenant access β
β βββ Presigned URLs scoped to tenant prefix β
β βββ Encryption keys per tenant (enterprise) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation
# tenant_isolation.py - Core tenant context management
from contextvars import ContextVar
from dataclasses import dataclass
from typing import Optional
from enum import Enum
# Thread-safe tenant context
_tenant_context: ContextVar[Optional['TenantContext']] = ContextVar(
'tenant_context', default=None
)
class IsolationLevel(Enum):
"""Tenant isolation levels by plan."""
SHARED = "shared" # Shared tables, RLS
SCHEMA = "schema" # Schema per tenant
DATABASE = "database" # Database per tenant
DEDICATED = "dedicated" # Dedicated infrastructure
@dataclass(frozen=True)
class TenantContext:
"""Immutable tenant context for request processing."""
tenant_id: str
tenant_name: str
region: str
plan: str
isolation_level: IsolationLevel
encryption_key_id: str
features: frozenset
def get_current_tenant() -> TenantContext:
"""Get current tenant context. Raises if not set."""
ctx = _tenant_context.get()
if ctx is None:
raise TenantContextError("No tenant context set")
return ctx
class TenantMiddleware:
"""
Middleware that establishes tenant context for every request.
Tenant is determined from:
1. Subdomain (acme.docplatform.com β acme)
2. JWT token claims
3. API key lookup
NEVER from request body or query parameters.
"""
async def __call__(self, request, call_next):
# Extract tenant from subdomain
host = request.headers.get("host", "")
subdomain = host.split(".")[0]
# Or from JWT
if not subdomain or subdomain in ["www", "api"]:
token = request.headers.get("authorization", "").replace("Bearer ", "")
claims = decode_jwt(token)
subdomain = claims.get("tenant_id")
if not subdomain:
return JSONResponse({"error": "Tenant not identified"}, status_code=400)
# Load tenant configuration
tenant_config = await self.tenant_service.get_tenant(subdomain)
if not tenant_config:
return JSONResponse({"error": "Tenant not found"}, status_code=404)
# Set immutable context
context = TenantContext(
tenant_id=tenant_config.id,
tenant_name=tenant_config.name,
region=tenant_config.region,
plan=tenant_config.plan,
isolation_level=IsolationLevel(tenant_config.isolation_level),
encryption_key_id=tenant_config.encryption_key_id,
features=frozenset(tenant_config.features)
)
# Set context for this request
token = _tenant_context.set(context)
try:
response = await call_next(request)
return response
finally:
_tenant_context.reset(token)
class TenantAwareRepository:
"""
Repository base class that enforces tenant isolation.
ALL database access goes through this class.
"""
def __init__(self, db_pool):
self.db = db_pool
async def find_by_id(self, table: str, id: str):
"""Find record by ID within current tenant."""
tenant = get_current_tenant()
# ALWAYS filter by tenant_id
return await self.db.fetchone(
f"SELECT * FROM {table} WHERE id = $1 AND tenant_id = $2",
id, tenant.tenant_id
)
async def find_all(self, table: str, filters: dict = None):
"""Find all records within current tenant."""
tenant = get_current_tenant()
query = f"SELECT * FROM {table} WHERE tenant_id = $1"
params = [tenant.tenant_id]
if filters:
for i, (key, value) in enumerate(filters.items(), start=2):
query += f" AND {key} = ${i}"
params.append(value)
return await self.db.fetch(query, *params)
async def create(self, table: str, data: dict):
"""Create record with automatic tenant_id."""
tenant = get_current_tenant()
# Force tenant_id - cannot be overridden
data["tenant_id"] = tenant.tenant_id
columns = ", ".join(data.keys())
placeholders = ", ".join(f"${i+1}" for i in range(len(data)))
return await self.db.fetchone(
f"INSERT INTO {table} ({columns}) VALUES ({placeholders}) RETURNING *",
*data.values()
)
Database Row-Level Security
-- PostgreSQL Row-Level Security Setup
-- Enable RLS on all tenant tables
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
ALTER TABLE folders ENABLE ROW LEVEL SECURITY;
ALTER TABLE comments ENABLE ROW LEVEL SECURITY;
ALTER TABLE audit_logs ENABLE ROW LEVEL SECURITY;
-- Create policy that filters by current tenant
CREATE POLICY tenant_isolation_policy ON documents
FOR ALL
USING (tenant_id = current_setting('app.current_tenant_id')::uuid);
CREATE POLICY tenant_isolation_policy ON users
FOR ALL
USING (tenant_id = current_setting('app.current_tenant_id')::uuid);
-- Similar for all tables...
-- Application sets tenant before any query
-- SET app.current_tenant_id = 'tenant-uuid-here';
-- Now even this query is filtered:
-- SELECT * FROM documents;
-- Only returns current tenant's documents!
Interviewer: "What about the large enterprise tenant with 10M documents? Don't they need stronger isolation?"
You: "Absolutely. For enterprise customers, we offer dedicated isolation."
Enterprise Tier Isolation
# enterprise_isolation.py
class EnterpriseTenantManager:
"""
Manages dedicated resources for enterprise tenants.
"""
async def provision_enterprise_tenant(
self,
tenant_id: str,
config: EnterpriseConfig
):
"""
Provision dedicated resources for enterprise tenant.
Options:
- Dedicated database
- Dedicated Elasticsearch index
- Dedicated encryption keys
- Dedicated compute (optional)
"""
# 1. Create dedicated database
if config.dedicated_database:
await self._create_tenant_database(tenant_id)
# 2. Create dedicated search index
await self._create_tenant_search_index(tenant_id)
# 3. Create tenant-specific KMS key
key_id = await self._create_tenant_kms_key(tenant_id)
# 4. Update tenant config
await self.tenant_service.update_tenant(
tenant_id,
isolation_level=IsolationLevel.DATABASE,
encryption_key_id=key_id,
database_name=f"tenant_{tenant_id}",
search_index=f"documents_{tenant_id}"
)
async def _create_tenant_database(self, tenant_id: str):
"""Create dedicated PostgreSQL database for tenant."""
safe_name = tenant_id.replace("-", "_")
await self.admin_db.execute(f"""
CREATE DATABASE tenant_{safe_name};
""")
# Run migrations on new database
await self._run_migrations(f"tenant_{safe_name}")
async def _create_tenant_kms_key(self, tenant_id: str) -> str:
"""Create dedicated KMS key for tenant encryption."""
response = await self.kms.create_key(
Description=f"Encryption key for tenant {tenant_id}",
KeyUsage="ENCRYPT_DECRYPT",
Tags=[
{"TagKey": "tenant_id", "TagValue": tenant_id},
{"TagKey": "purpose", "TagValue": "document_encryption"}
]
)
return response["KeyMetadata"]["KeyId"]
Deep Dive 2: Noisy Neighbor Prevention (Week 9, Day 2)
Interviewer: "You mentioned the largest tenant has 10M documents. How do you prevent them from impacting smaller tenants?"
You: "This is the noisy neighbor problem. I'd implement quotas and fair scheduling at multiple levels."
Resource Quota System
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUOTA SYSTEM BY PLAN β
β β
β Resource β Standard β Professional β Enterprise β
β βββββββββββββββββββββββΌβββββββββββββΌβββββββββββββββΌβββββββββββββββββ β
β API calls/minute β 1,000 β 10,000 β 100,000 β
β Search calls/minute β 100 β 1,000 β 10,000 β
β Storage (GB) β 100 β 1,000 β 10,000 β
β Documents β 100,000 β 1,000,000 β Unlimited* β
β Users β 100 β 1,000 β Unlimited β
β Concurrent uploads β 5 β 20 β 100 β
β Max file size (MB) β 50 β 200 β 500 β
β Search query timeout β 10s β 30s β 60s β
β Export rate (docs/hr) β 10,000 β 100,000 β 1,000,000 β
β β
β * "Unlimited" = 100M with fair use policy β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation
# noisy_neighbor/rate_limiter.py
class TenantResourceManager:
"""
Manages and enforces resource quotas per tenant.
"""
def __init__(self, redis, quota_config):
self.redis = redis
self.quotas = quota_config
async def check_rate_limit(
self,
tenant_id: str,
resource: str,
cost: int = 1
) -> RateLimitResult:
"""
Check if request is within rate limits.
Uses token bucket algorithm with per-tenant buckets.
"""
tenant = await self.get_tenant_config(tenant_id)
limit = self.quotas[tenant.plan][resource]
key = f"ratelimit:{tenant_id}:{resource}"
# Lua script for atomic check-and-decrement
script = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local cost = tonumber(ARGV[2])
local window = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
-- Get current count
local current = tonumber(redis.call('GET', key) or '0')
if current + cost <= limit then
redis.call('INCRBY', key, cost)
redis.call('EXPIRE', key, window)
return {1, limit - current - cost}
else
local ttl = redis.call('TTL', key)
return {0, ttl}
end
"""
result = await self.redis.eval(
script,
keys=[key],
args=[limit, cost, 60, time.time()]
)
allowed = result[0] == 1
if not allowed:
await self._record_throttle(tenant_id, resource)
return RateLimitResult(
allowed=allowed,
remaining=result[1] if allowed else 0,
retry_after=result[1] if not allowed else None
)
async def check_concurrent_limit(
self,
tenant_id: str,
operation: str
) -> bool:
"""
Check concurrent operation limit.
Prevents one tenant from using all worker capacity.
"""
tenant = await self.get_tenant_config(tenant_id)
limit = self.quotas[tenant.plan][f"concurrent_{operation}"]
key = f"concurrent:{tenant_id}:{operation}"
current = await self.redis.get(key) or 0
return int(current) < limit
async def acquire_concurrent_slot(
self,
tenant_id: str,
operation: str,
operation_id: str,
ttl: int = 300
) -> bool:
"""Acquire a concurrent operation slot."""
if not await self.check_concurrent_limit(tenant_id, operation):
return False
key = f"concurrent:{tenant_id}:{operation}"
await self.redis.sadd(key, operation_id)
await self.redis.expire(key, ttl)
return True
async def release_concurrent_slot(
self,
tenant_id: str,
operation: str,
operation_id: str
):
"""Release a concurrent operation slot."""
key = f"concurrent:{tenant_id}:{operation}"
await self.redis.srem(key, operation_id)
class SearchQueryGuard:
"""
Guards search queries against noisy neighbors.
"""
async def guard_search(
self,
tenant_id: str,
query: SearchQuery
) -> tuple[bool, Optional[int]]:
"""
Check if search query should be allowed.
Returns (allowed, timeout_seconds).
"""
tenant = await self.tenant_service.get_tenant(tenant_id)
# Get tenant's search timeout
timeout = self.quotas[tenant.plan]["search_timeout"]
# Estimate query cost
cost = self._estimate_query_cost(query)
if cost > 100: # High cost query
# Check if tenant can run expensive queries
if tenant.plan == "standard":
return False, None
# Use longer timeout for expensive queries
timeout = min(timeout * 2, 120)
return True, timeout
def _estimate_query_cost(self, query: SearchQuery) -> int:
"""Estimate query cost based on complexity."""
cost = 1
# Wildcard queries are expensive
if "*" in query.text or "?" in query.text:
cost += 10
# Regex queries are very expensive
if query.use_regex:
cost += 50
# Large result sets
if query.limit > 100:
cost += query.limit // 100
return cost
Fair Scheduling for Background Jobs
# noisy_neighbor/fair_scheduler.py
class TenantFairQueue:
"""
Fair queue that prevents one tenant from monopolizing workers.
Uses weighted fair queuing based on tenant plan.
"""
PLAN_WEIGHTS = {
"standard": 1,
"professional": 3,
"enterprise": 10
}
async def enqueue(
self,
tenant_id: str,
job: Job
):
"""Add job to tenant's queue."""
tenant = await self.tenant_service.get_tenant(tenant_id)
# Check queue depth limit
current_depth = await self.get_queue_depth(tenant_id)
max_depth = self.quotas[tenant.plan]["max_queue_depth"]
if current_depth >= max_depth:
raise QueueFullError(
f"Queue full for tenant {tenant_id}. "
f"Current: {current_depth}, Max: {max_depth}"
)
# Add to tenant's queue
await self.redis.lpush(
f"queue:{tenant_id}",
job.serialize()
)
async def dequeue(self) -> Optional[tuple[str, Job]]:
"""
Dequeue next job using weighted fair scheduling.
Higher-weight tenants get proportionally more slots.
"""
# Get all tenant queues with pending jobs
tenant_queues = await self._get_active_queues()
if not tenant_queues:
return None
# Calculate weighted selection
weighted_tenants = []
for tenant_id, depth in tenant_queues.items():
tenant = await self.tenant_service.get_tenant(tenant_id)
weight = self.PLAN_WEIGHTS.get(tenant.plan, 1)
# Weight decreases as queue depth increases (fairness)
adjusted_weight = weight / (1 + depth * 0.1)
weighted_tenants.append((tenant_id, adjusted_weight))
# Select tenant based on weights
selected_tenant = self._weighted_random_choice(weighted_tenants)
# Pop job from selected tenant's queue
job_data = await self.redis.rpop(f"queue:{selected_tenant}")
if job_data:
return selected_tenant, Job.deserialize(job_data)
return None
Deep Dive 3: Data Residency and GDPR (Week 9, Day 3)
Interviewer: "You mentioned EU customers need data in EU. How exactly does that work, and what about GDPR compliance?"
You: "Data residency and GDPR are handled through regional isolation and comprehensive data management. Let me walk through both."
Regional Data Routing
# data_residency/router.py
class RegionalDataRouter:
"""
Routes data operations to the correct region based on tenant.
EU tenant data NEVER touches US infrastructure.
"""
REGION_CONFIGS = {
"us": {
"database": "postgres://db-us.internal:5432/app",
"elasticsearch": "https://es-us.internal:9200",
"s3_bucket": "documents-us-east-1",
"redis": "redis://cache-us.internal:6379"
},
"eu": {
"database": "postgres://db-eu.internal:5432/app",
"elasticsearch": "https://es-eu.internal:9200",
"s3_bucket": "documents-eu-central-1",
"redis": "redis://cache-eu.internal:6379"
},
"apac": {
"database": "postgres://db-apac.internal:5432/app",
"elasticsearch": "https://es-apac.internal:9200",
"s3_bucket": "documents-ap-southeast-1",
"redis": "redis://cache-apac.internal:6379"
}
}
async def get_database_connection(self, tenant_id: str):
"""Get database connection for tenant's region."""
tenant = await self.tenant_service.get_tenant(tenant_id)
config = self.REGION_CONFIGS[tenant.region]
return await self.connection_pools[tenant.region].acquire()
async def get_storage_bucket(self, tenant_id: str) -> str:
"""Get S3 bucket for tenant's region."""
tenant = await self.tenant_service.get_tenant(tenant_id)
return self.REGION_CONFIGS[tenant.region]["s3_bucket"]
async def upload_document(
self,
tenant_id: str,
document_id: str,
content: bytes
) -> str:
"""Upload document to tenant's regional storage."""
tenant = await self.tenant_service.get_tenant(tenant_id)
bucket = self.REGION_CONFIGS[tenant.region]["s3_bucket"]
# Path includes tenant for isolation
key = f"tenants/{tenant_id}/documents/{document_id}"
# Encrypt with tenant's key before upload
encrypted = await self.encrypt_for_tenant(tenant_id, content)
await self.s3.put_object(
Bucket=bucket,
Key=key,
Body=encrypted,
ServerSideEncryption="aws:kms",
SSEKMSKeyId=tenant.encryption_key_id
)
return f"s3://{bucket}/{key}"
GDPR Consent Management
# gdpr/consent.py
class GDPRConsentService:
"""
Manages user consent for GDPR compliance.
"""
CONSENT_PURPOSES = [
"service_delivery", # Required for service
"analytics", # Product analytics
"marketing_email", # Marketing communications
"third_party_sharing", # Sharing with partners
]
async def record_consent(
self,
user_id: str,
tenant_id: str,
purpose: str,
granted: bool,
ip_address: str,
consent_text: str
) -> ConsentRecord:
"""
Record a consent decision.
Creates immutable audit record.
"""
record = ConsentRecord(
id=str(uuid.uuid4()),
user_id=user_id,
tenant_id=tenant_id,
purpose=purpose,
status="granted" if granted else "denied",
granted_at=datetime.utcnow() if granted else None,
ip_address=ip_address,
consent_text=consent_text,
consent_version="2024-01"
)
# Store in regional database
await self.db.execute(
"""
INSERT INTO consent_records
(id, user_id, tenant_id, purpose, status, granted_at,
ip_address, consent_text, consent_version, created_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
""",
record.id, record.user_id, record.tenant_id,
record.purpose, record.status, record.granted_at,
record.ip_address, record.consent_text,
record.consent_version, datetime.utcnow()
)
# Publish event for downstream systems
await self.events.publish("consent", {
"type": "consent.recorded",
"user_id": user_id,
"purpose": purpose,
"granted": granted
})
return record
async def check_consent(
self,
user_id: str,
tenant_id: str,
purpose: str
) -> bool:
"""Check if user has consented to a purpose."""
result = await self.db.fetchone(
"""
SELECT status FROM consent_records
WHERE user_id = $1 AND tenant_id = $2 AND purpose = $3
ORDER BY created_at DESC
LIMIT 1
""",
user_id, tenant_id, purpose
)
return result and result["status"] == "granted"
Data Export (Right to Portability)
# gdpr/export.py
class GDPRDataExporter:
"""
Exports user data for GDPR portability requests.
"""
async def export_user_data(
self,
user_id: str,
tenant_id: str
) -> DataExportResult:
"""
Export all user's personal data.
GDPR Article 20: Right to data portability
"""
export_id = str(uuid.uuid4())
# Collect data from all sources
data = {
"export_metadata": {
"export_id": export_id,
"exported_at": datetime.utcnow().isoformat(),
"user_id": user_id,
"tenant_id": tenant_id
},
"profile": await self._export_profile(user_id, tenant_id),
"documents": await self._export_documents(user_id, tenant_id),
"comments": await self._export_comments(user_id, tenant_id),
"activity_history": await self._export_activity(user_id, tenant_id),
"consent_records": await self._export_consent(user_id, tenant_id),
}
# Package as JSON
export_json = json.dumps(data, indent=2, default=str)
# Also create ZIP with actual files
zip_buffer = await self._create_export_zip(user_id, tenant_id, data)
# Upload to tenant's regional storage
bucket = await self.router.get_storage_bucket(tenant_id)
export_key = f"exports/{tenant_id}/{user_id}/{export_id}.zip"
await self.s3.put_object(
Bucket=bucket,
Key=export_key,
Body=zip_buffer.getvalue()
)
# Generate download link (expires in 7 days)
download_url = await self.s3.generate_presigned_url(
"get_object",
Params={"Bucket": bucket, "Key": export_key},
ExpiresIn=604800 # 7 days
)
return DataExportResult(
export_id=export_id,
download_url=download_url,
expires_at=datetime.utcnow() + timedelta(days=7),
size_bytes=len(zip_buffer.getvalue())
)
Deep Dive 4: Right to Deletion (Week 9, Day 4)
Interviewer: "When a user requests deletion, how do you ensure all their data is removed from all those systems?"
You: "Deletion is one of the hardest compliance requirements. I'd implement a coordinated deletion workflow with verification."
Deletion Orchestration
# gdpr/deletion.py
class UserDeletionService:
"""
Orchestrates user data deletion across all systems.
"""
# Systems in deletion order (dependencies first)
DELETION_TARGETS = [
("cache", "redis", 1), # Clear cache first
("search", "elasticsearch", 2), # Remove from search
("storage", "s3", 3), # Delete files
("analytics", "bigquery", 4), # Remove from analytics
("database", "postgresql", 10), # Primary DB last
]
async def process_deletion_request(
self,
user_id: str,
tenant_id: str,
requested_by: str
) -> DeletionRequest:
"""
Process a GDPR deletion request.
Must complete within 30 days per GDPR.
"""
request = DeletionRequest(
id=str(uuid.uuid4()),
user_id=user_id,
tenant_id=tenant_id,
requested_at=datetime.utcnow(),
requested_by=requested_by,
deadline=datetime.utcnow() + timedelta(days=30),
status="pending"
)
# Store request
await self._save_request(request)
# Execute deletion workflow
try:
await self._execute_deletion(request)
# Verify deletion
verification = await self._verify_deletion(request)
if verification.all_verified:
request.status = "completed"
request.completed_at = datetime.utcnow()
else:
request.status = "partial"
request.issues = verification.issues
except Exception as e:
request.status = "failed"
request.error = str(e)
raise
finally:
await self._save_request(request)
await self._notify_user(request)
return request
async def _execute_deletion(self, request: DeletionRequest):
"""Execute deletion across all systems."""
for system_type, system_name, priority in sorted(
self.DELETION_TARGETS, key=lambda x: x[2]
):
target = DeletionTarget(
system_name=system_name,
system_type=system_type,
status="pending"
)
try:
executor = self.executors[system_name]
result = await executor.delete_user_data(
request.user_id,
request.tenant_id
)
target.status = "completed"
target.records_deleted = result.get("records_deleted", 0)
await self.audit.log(
action="deletion_executed",
system=system_name,
user_id=request.user_id,
records_deleted=target.records_deleted
)
except Exception as e:
target.status = "failed"
target.error = str(e)
raise
request.targets.append(target)
async def _verify_deletion(
self,
request: DeletionRequest
) -> VerificationResult:
"""Verify that deletion was successful."""
issues = []
for target in request.targets:
executor = self.executors[target.system_name]
still_exists = await executor.check_user_exists(
request.user_id,
request.tenant_id
)
if still_exists:
issues.append(f"Data still exists in {target.system_name}")
return VerificationResult(
all_verified=len(issues) == 0,
issues=issues
)
class PostgreSQLDeletionExecutor:
"""
Deletes user data from PostgreSQL.
"""
async def delete_user_data(
self,
user_id: str,
tenant_id: str
) -> dict:
"""Delete user and related data."""
records_deleted = 0
async with self.db.transaction():
# Delete from leaf tables first
# Comments (anonymize, keep content)
result = await self.db.execute(
"""
UPDATE comments
SET user_id = NULL, author_name = 'Deleted User'
WHERE user_id = $1 AND tenant_id = $2
""",
user_id, tenant_id
)
records_deleted += int(result.split()[-1])
# Activity logs (anonymize)
result = await self.db.execute(
"""
UPDATE activity_logs
SET user_id = 'DELETED', ip_address = 'DELETED'
WHERE user_id = $1 AND tenant_id = $2
""",
user_id, tenant_id
)
records_deleted += int(result.split()[-1])
# Documents (reassign ownership to admin or delete)
result = await self.db.execute(
"""
UPDATE documents
SET owner_id = (
SELECT id FROM users
WHERE tenant_id = $2 AND 'admin' = ANY(roles)
LIMIT 1
)
WHERE owner_id = $1 AND tenant_id = $2
""",
user_id, tenant_id
)
records_deleted += int(result.split()[-1])
# Consent records (keep anonymized for audit)
result = await self.db.execute(
"""
UPDATE consent_records
SET user_id = 'DELETED', ip_address = 'DELETED'
WHERE user_id = $1 AND tenant_id = $2
""",
user_id, tenant_id
)
records_deleted += int(result.split()[-1])
# Finally, delete user
result = await self.db.execute(
"DELETE FROM users WHERE id = $1 AND tenant_id = $2",
user_id, tenant_id
)
records_deleted += int(result.split()[-1])
return {"records_deleted": records_deleted}
Deep Dive 5: Security Architecture (Week 9, Day 5)
Interviewer: "Let's talk security. How do you protect this system, especially with multiple tenants?"
You: "Security is defense in depth with zero trust principles. Let me walk through the layers."
Security Layers
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SECURITY ARCHITECTURE β
β β
β LAYER 1: EDGE SECURITY β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CloudFront (CDN) β WAF β Rate Limiting β DDoS Protection β β
β β βββ OWASP Top 10 rules β β
β β βββ Bot detection β β
β β βββ IP reputation β β
β β βββ Geo-blocking (optional per tenant) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β LAYER 2: NETWORK SECURITY β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β VPC isolation: β β
β β βββ Public subnet: ALB only β β
β β βββ Private subnet: App servers β β
β β βββ Isolated subnet: Databases β β
β β βββ Security groups: Explicit allow only β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β LAYER 3: APPLICATION SECURITY β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β βββ Authentication (JWT + MFA) β β
β β βββ Authorization (RBAC + tenant isolation) β β
β β βββ Input validation (Pydantic schemas) β β
β β βββ Output encoding β β
β β βββ CSRF protection β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β LAYER 4: DATA SECURITY β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β βββ Encryption in transit (TLS 1.3) β β
β β βββ Encryption at rest (AES-256) β β
β β βββ Per-tenant encryption keys (enterprise) β β
β β βββ Secrets in Vault β β
β β βββ Data classification and handling β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β LAYER 5: MONITORING & DETECTION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β βββ Audit logging (all access) β β
β β βββ Anomaly detection β β
β β βββ SIEM integration β β
β β βββ Incident response automation β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Authentication and Authorization
# security/auth.py
class AuthenticationService:
"""
Multi-tenant authentication service.
"""
async def authenticate(
self,
email: str,
password: str,
tenant_id: str,
ip_address: str
) -> AuthResult:
"""
Authenticate user within their tenant.
"""
# Rate limiting per IP + tenant
if await self._is_rate_limited(ip_address, tenant_id):
raise AuthError("Too many attempts")
# Find user in tenant
user = await self.db.fetchone(
"""
SELECT id, email, password_hash, roles, mfa_enabled, status
FROM users
WHERE email = $1 AND tenant_id = $2
""",
email.lower(), tenant_id
)
if not user:
await self._record_failed_attempt(ip_address, tenant_id)
raise AuthError("Invalid credentials")
if user["status"] != "active":
raise AuthError("Account disabled")
# Verify password
if not bcrypt.checkpw(password.encode(), user["password_hash"].encode()):
await self._record_failed_attempt(ip_address, tenant_id)
raise AuthError("Invalid credentials")
# Create session
session_id = secrets.token_urlsafe(32)
result = AuthResult(
user_id=user["id"],
tenant_id=tenant_id,
roles=user["roles"],
mfa_required=user["mfa_enabled"],
session_id=session_id
)
# Audit log
await self.audit.log(
action="login_success",
user_id=user["id"],
tenant_id=tenant_id,
ip_address=ip_address
)
return result
async def create_jwt(self, auth_result: AuthResult) -> str:
"""Create JWT with tenant claims."""
signing_key = await self.secrets.get_secret("jwt/signing_key")
payload = {
"sub": auth_result.user_id,
"tenant_id": auth_result.tenant_id,
"roles": auth_result.roles,
"session_id": auth_result.session_id,
"iat": datetime.utcnow(),
"exp": datetime.utcnow() + timedelta(minutes=15)
}
return jwt.encode(payload, signing_key.value, algorithm="RS256")
class AuthorizationService:
"""
Multi-tenant authorization with RBAC.
"""
async def check_document_access(
self,
user_id: str,
tenant_id: str,
document_id: str,
required_permission: str
) -> bool:
"""
Check if user can access document.
Enforces:
1. Tenant isolation (user's tenant == document's tenant)
2. Role-based permission
3. Document-level permission
"""
# Get document
document = await self.db.fetchone(
"""
SELECT tenant_id, owner_id, permissions
FROM documents
WHERE id = $1
""",
document_id
)
if not document:
return False
# CRITICAL: Tenant isolation check
if document["tenant_id"] != tenant_id:
await self.audit.log(
action="access_denied",
reason="tenant_mismatch",
user_id=user_id,
document_id=document_id
)
return False
# Check ownership
if document["owner_id"] == user_id:
return True
# Check document permissions
permissions = document["permissions"] or {}
user_permission = permissions.get(user_id)
if user_permission:
return self._has_permission(user_permission, required_permission)
# Check folder permissions (inherited)
# ... folder permission check logic ...
return False
Comprehensive Audit Logging
# security/audit.py
class AuditService:
"""
Comprehensive audit logging for compliance.
"""
async def log(
self,
action: str,
**context
):
"""
Log an audit event.
All access, modifications, and security events are logged.
"""
tenant = get_current_tenant()
event = AuditEvent(
id=str(uuid.uuid4()),
timestamp=datetime.utcnow(),
tenant_id=tenant.tenant_id if tenant else context.get("tenant_id"),
action=action,
actor_id=context.get("user_id"),
actor_type=context.get("actor_type", "user"),
resource_type=context.get("resource_type"),
resource_id=context.get("resource_id"),
ip_address=context.get("ip_address"),
user_agent=context.get("user_agent"),
details=context
)
# Write to regional audit log (immutable)
await self.db.execute(
"""
INSERT INTO audit_logs
(id, timestamp, tenant_id, action, actor_id, actor_type,
resource_type, resource_id, ip_address, user_agent, details)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)
""",
event.id, event.timestamp, event.tenant_id, event.action,
event.actor_id, event.actor_type, event.resource_type,
event.resource_id, event.ip_address, event.user_agent,
json.dumps(event.details)
)
# Also stream to Kafka for real-time monitoring
await self.kafka.produce(
topic="audit_events",
key=event.tenant_id,
value=event.to_dict()
)
Phase 5: Scaling and Edge Cases (5 minutes)
Interviewer: "How would this system scale if we went from 500 to 5,000 tenants?"
Scaling Strategy
You: "The architecture is designed to scale horizontally. Here's how each component scales:"
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SCALING STRATEGY β
β β
β Component β Current β 10x Scale β How β
β ββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββββββββββ β
β API Servers β 20 instances β 200 instancesβ Auto-scaling group β
β PostgreSQL β 2TB per regionβ Sharding β By tenant_id β
β Elasticsearch β 3TB per regionβ 30TB clusterβ Add nodes β
β S3 β 180TB β 1.8PB β Automatic β
β Redis β 50GB cluster β 500GB clusterβ Add shards β
β Workers β 10 instances β 100 instancesβ Queue-based scale β
β β
β KEY SCALING DECISIONS: β
β β
β 1. Database sharding by tenant_id β
β βββ Keeps tenant data together β
β βββ Enables tenant-level backup/restore β
β βββ Large tenants can get dedicated shards β
β β
β 2. Search index per tenant (for large tenants) β
β βββ Avoids hot spots β
β βββ Enables tenant-specific tuning β
β βββ Easier to delete/migrate β
β β
β 3. Add more regions as needed β
β βββ Japan region for Japanese customers β
β βββ Australia region for AU/NZ β
β βββ Each region is independent β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Edge Cases
Interviewer: "What edge cases should we handle?"
You: "Several important edge cases:"
EDGE CASES
1. TENANT OFFBOARDING
βββ Customer cancels subscription
βββ 30-day grace period (can reactivate)
βββ Export data to customer
βββ Then complete deletion
βββ Retain anonymized audit logs
2. LARGE FILE UPLOADS
βββ Customer uploads 500MB presentation
βββ Direct-to-S3 upload (presigned URL)
βββ Chunked upload for resume
βββ Async processing in background
βββ Progress tracking
3. SEARCH INDEX CORRUPTION
βββ Elasticsearch index gets corrupted
βββ Detection: Scheduled consistency checks
βββ Recovery: Rebuild from PostgreSQL
βββ Tenant isolated: Only one tenant affected
βββ Automated healing with alerting
4. CROSS-TENANT SHARING (External links)
βββ User shares document externally
βββ Generate unique, expiring token
βββ Token tied to document, not tenant context
βββ Audit log records external access
βββ Owner can revoke anytime
5. SSO PROVIDER OUTAGE
βββ Customer's SSO is down
βββ Fallback to email/password
βββ Requires pre-configured backup auth
βββ Audit log notes SSO bypass
βββ Notify tenant admin
6. REGULATORY HOLD
βββ Legal hold prevents deletion
βββ Mark documents as "held"
βββ Deletion requests queued
βββ User notified of delay
βββ Release when hold lifted
Phase 6: Monitoring and Operations (5 minutes)
Interviewer: "How would you monitor this system in production?"
Key Metrics
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MONITORING DASHBOARD β
β β
β BUSINESS METRICS (per tenant) β
β βββ Active users [ββββββββββ] 8,234/10,000 β
β βββ Documents stored [ββββββββββ] 4.2M / 5M β
β βββ Storage used [ββββββββββ] 720GB / 1TB β
β βββ API calls today [ββββββββββ] 42K / 100K β
β β
β SYSTEM HEALTH (per region) β
β βββ API latency p99 [ββββββββββ] 180ms (< 200ms) β
β βββ Search latency p99 [ββββββββββ] 320ms (< 500ms) β
β βββ Error rate [ββββββββββ] 0.02% (< 0.1%) β
β βββ Throughput [ββββββββββ] 800 req/s β
β β
β SECURITY METRICS β
β βββ Failed login attempts/hour [ββββββββββ] 234 β
β βββ Cross-tenant access attempts [ββββββββββ] 0 (should be 0!) β
β βββ WAF blocked requests [ββββββββββ] 1,234/hour β
β βββ MFA adoption [ββββββββββ] 62% β
β β
β COMPLIANCE METRICS β
β βββ Pending deletion requests [ββββββββββ] 3 β
β βββ Avg deletion completion time [ββββββββββ] 4.2 days β
β βββ Data export requests/week [ββββββββββ] 12 β
β βββ Audit log retention [ββββββββββ] 7 years β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Alerting Strategy
CRITICAL (PagerDuty - immediate response):
βββ Cross-tenant data access detected
βββ Database replication lag > 1 minute
βββ Error rate > 1%
βββ Any region unreachable
βββ Security incident detected
WARNING (Slack - business hours):
βββ Tenant approaching quota limits
βββ Search latency > 400ms p99
βββ Deletion request approaching deadline
βββ Failed login spike for tenant
βββ Certificate expiring < 30 days
INFO (Dashboard only):
βββ New tenant onboarded
βββ Large file upload completed
βββ Scheduled maintenance
βββ Feature flag changed
Interview Conclusion
Interviewer: "This is a comprehensive design. You've covered multi-tenancy, compliance, and security thoroughly. Any final thoughts?"
You: "A few things I'd prioritize for implementation:
- Start with tenant isolation - Get this right first, it's the foundation
- Build compliance into the architecture - Retrofitting GDPR is painful
- Invest in observability early - Per-tenant metrics from day one
- Plan for enterprise features - Dedicated resources, custom encryption
- Security as code - Infrastructure as code, security policies as code
The key insight is that multi-tenant SaaS is harder than single-tenant because every feature needs to consider isolation, fairness, and compliance from the start."
Interviewer: "Excellent. Thanks for walking through this with me."
Summary: Week 9 Concepts Applied
Concepts by Day
| Day | Topic | Application in Design |
|---|---|---|
| Day 1 | Tenant Isolation | Multi-layer isolation (app, DB RLS, storage paths), enterprise dedicated resources |
| Day 2 | Noisy Neighbor | Per-tenant quotas, rate limiting, fair scheduling, query guards |
| Day 3 | Data Residency | Regional data planes, no cross-region data flow, consent management |
| Day 4 | Right to Deletion | Deletion orchestration, verification, anonymization vs delete |
| Day 5 | Security | Defense in depth, zero trust, encryption layers, audit logging |
Code Patterns Demonstrated
1. TENANT CONTEXT MANAGEMENT
βββ Immutable TenantContext dataclass
βββ ContextVar for thread-safe propagation
βββ Middleware sets context from JWT/subdomain
βββ All services use get_current_tenant()
2. REPOSITORY PATTERN WITH ISOLATION
βββ TenantAwareRepository base class
βββ Auto-adds tenant_id to all queries
βββ RLS as database-level backup
βββ No raw SQL without tenant filter
3. REGIONAL DATA ROUTING
βββ RegionalDataRouter for all data ops
βββ Tenant config determines region
βββ Each region has full stack
βββ Global control plane for metadata only
4. DELETION ORCHESTRATION
βββ DeletionService coordinates
βββ System-specific executors
βββ Verification confirms deletion
βββ Audit trail survives deletion
5. DEFENSE IN DEPTH SECURITY
βββ Edge β Network β App β Data layers
βββ Each layer assumes others might fail
βββ Zero trust between services
βββ Comprehensive audit logging
Self-Assessment Checklist
After studying this capstone, you should be able to:
- Design multi-tenant systems with proper data isolation
- Implement Row-Level Security in PostgreSQL
- Build per-tenant rate limiting and quota systems
- Architect for data residency requirements
- Handle GDPR consent, export, and deletion
- Design defense-in-depth security architecture
- Implement comprehensive audit logging
- Scale multi-tenant systems horizontally
- Handle edge cases like tenant offboarding
- Monitor multi-tenant systems with per-tenant metrics
This capstone integrates all concepts from Week 9: Multi-Tenancy, Security, and Compliance. Use this as a template for approaching enterprise SaaS system design interviews.