Capstone

Week 9 Capstone: Design a Multi-Tenant Enterprise SaaS Platform

🎯 A Complete System Design Interview Integrating Everything You've Learned

The Interview Begins

You walk into the interview room at a well-known B2B SaaS company. The interviewer, a Staff Engineer, greets you warmly.

Interviewer: "Thanks for coming in today. We're going to work through a system design problem that reflects challenges we face here. I want to see how you think through complex multi-tenant systems. Feel free to ask questions — this is collaborative."

They turn to the whiteboard and write:

╔══════════════════════════════════════════════════════════════════════════╗
║                                                                          ║
║           Design: Enterprise Document Management Platform                ║
║                                                                          ║
║   You're building a B2B SaaS platform where companies can:               ║
║                                                                          ║
║   • Store and organize documents (contracts, policies, reports)          ║
║   • Search across all their documents with full-text search              ║
║   • Collaborate on documents with comments and version history           ║
║   • Set granular permissions (who can view/edit/delete)                  ║
║   • Generate audit trails for compliance                                 ║
║   • Export their data for compliance or migration                        ║
║                                                                          ║
║   Customer profile:                                                      ║
║   • 500 enterprise customers (tenants)                                   ║
║   • Mix of US, EU, and APAC customers                                    ║
║   • Some customers in regulated industries (healthcare, finance)         ║
║   • Largest customer has 50,000 users and 10M documents                  ║
║   • Smallest customers have 50 users and 10K documents                   ║
║                                                                          ║
║   Key concerns from sales:                                               ║
║   • EU customers asking about GDPR and data residency                    ║
║   • Healthcare customers need HIPAA compliance                           ║
║   • Enterprise customers want SSO and audit logs                         ║
║   • Several deals lost because "search was too slow"                     ║
║                                                                          ║
╚══════════════════════════════════════════════════════════════════════════╝

Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."

Phase 1: Requirements Clarification (5 minutes)

You take a breath and start asking questions.

You: "Before I dive into the design, I'd like to clarify some requirements. Let me start with scale and then move to compliance."

Clarifying Questions

You: "For scale — you mentioned 500 tenants with the largest having 10M documents. What's the average document size, and what's our total storage footprint?"

Interviewer: "Average document is about 500KB. Mix of PDFs, Word docs, and some larger files like presentations. Total storage across all tenants is about 50TB, growing 20% annually."

You: "What's our request volume? How many searches, uploads, downloads per day?"

Interviewer: "Peak hours see about 1,000 requests per second across all tenants. Searches are the most common operation — maybe 60% of traffic. Uploads are bursty, especially end of quarter when companies finalize contracts."

You: "For the EU data residency requirement — do EU customers need all their data in the EU, or just personal data?"

Interviewer: "Great question. Our legal team says all document content and metadata for EU customers must stay in EU. Some operational data like aggregate metrics can be global."

You: "For the regulated industries — are we targeting HIPAA certification, or just 'HIPAA-ready' architecture?"

Interviewer: "We want the architecture to support HIPAA. Full certification is a business decision, but the technical foundation must be there. Same for SOC 2."

You: "One more — for the largest customers, do they get dedicated infrastructure, or is everyone on shared infrastructure with isolation?"

Interviewer: "We want to offer both. Most customers on shared infrastructure with strong isolation. Enterprise tier customers can opt for dedicated resources at premium pricing."

You: "Perfect. Let me summarize the requirements."

Functional Requirements

1. DOCUMENT MANAGEMENT
   ├── Upload documents (PDF, Word, Excel, images)
   ├── Organize in folders/hierarchies
   ├── Version history with rollback
   ├── Preview and download
   └── Bulk operations (move, delete, export)

2. SEARCH
   ├── Full-text search across document content
   ├── Metadata search (author, date, tags)
   ├── Filters and facets
   ├── Search within folders
   └── < 500ms p99 latency for search

3. COLLABORATION
   ├── Comments on documents
   ├── @mentions and notifications
   ├── Sharing with granular permissions
   ├── Activity feed per document
   └── Real-time presence (who's viewing)

4. ACCESS CONTROL
   ├── Role-based permissions (viewer, editor, admin)
   ├── Folder-level and document-level permissions
   ├── External sharing with expiring links
   ├── SSO integration (SAML, OIDC)
   └── MFA support

5. COMPLIANCE
   ├── Complete audit trail (who did what, when)
   ├── Data export (GDPR portability)
   ├── Data deletion (GDPR right to erasure)
   ├── Retention policies
   └── Legal hold capability

Non-Functional Requirements

1. SCALE
   ├── 500 tenants, growing to 2,000
   ├── 50TB storage, growing 20% annually
   ├── 1,000 requests/second peak
   ├── Largest tenant: 50K users, 10M documents
   └── Support 100K concurrent users globally

2. LATENCY
   ├── Search: < 500ms p99
   ├── Document preview: < 2s p99
   ├── Upload (10MB): < 5s p99
   └── API calls: < 200ms p99

3. AVAILABILITY
   ├── 99.9% uptime SLA
   ├── No single point of failure
   ├── Graceful degradation
   └── < 4 hours RTO, < 1 hour RPO

4. COMPLIANCE
   ├── EU data residency for EU customers
   ├── HIPAA-ready architecture
   ├── SOC 2 Type II controls
   └── GDPR compliance (consent, deletion, portability)

5. SECURITY
   ├── Encryption at rest and in transit
   ├── Tenant isolation (data and resources)
   ├── Zero trust architecture
   └── Regular security audits

Phase 2: Back of the Envelope Estimation (5 minutes)

You: "Let me work through the numbers to validate our architecture decisions."

Storage Estimation

DOCUMENT STORAGE

Current state:
  Total documents:           ~100M (across all tenants)
  Average document size:     500KB
  Total storage:             100M × 500KB = 50TB

  With metadata overhead:    50TB × 1.2 = 60TB
  With 3x replication:       60TB × 3 = 180TB raw storage

Growth projection (20% annually):
  Year 1:                    60TB
  Year 2:                    72TB
  Year 3:                    86TB
  Year 5:                    150TB

Per-tenant storage:
  Average tenant:            100GB (200K documents)
  Largest tenant:            5TB (10M documents)
  Smallest tenant:           5GB (10K documents)

Traffic Estimation

REQUEST TRAFFIC

Peak requests:               1,000/second
Daily requests:              ~50M (assuming 8-hour peak)

Breakdown by operation:
├── Search:                  600/sec (60%)
├── Read/download:           250/sec (25%)
├── Write/upload:            100/sec (10%)
└── Other (permissions, etc): 50/sec (5%)

Search index size:
  Average document text:     10KB extracted text
  Total index size:          100M × 10KB = 1TB
  With Elasticsearch overhead: ~3TB

Infrastructure Estimation

┌────────────────────────────────────────────────────────────────────────┐
│                    INFRASTRUCTURE ESTIMATE                             │
│                                                                        │
│  COMPUTE                                                               │
│  ├── API servers:              20 instances (m5.xlarge)                │
│  ├── Search cluster:           9 nodes (r5.2xlarge) - 3 per region     │
│  ├── Background workers:       10 instances (m5.large)                 │
│  └── Total:                    ~40 instances                           │
│                                                                        │
│  STORAGE                                                               │
│  ├── PostgreSQL:               Multi-AZ, 2TB per region                │
│  ├── Elasticsearch:            3TB per region                          │
│  ├── S3:                       180TB (with replication)                │
│  └── Redis:                    50GB cluster                            │
│                                                                        │
│  REGIONS                                                               │
│  ├── US (us-east-1):           Primary for US customers                │
│  ├── EU (eu-central-1):        Primary for EU customers                │
│  └── APAC (ap-southeast-1):    Primary for APAC customers              │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Phase 3: High-Level Design (10 minutes)

You: "Now let me sketch out the high-level architecture. Given our multi-region, multi-tenant requirements, I'll design for regional data isolation with a global control plane."

System Architecture

┌───────────────────────────────────────────────────────────────────────┐
│                    HIGH-LEVEL ARCHITECTURE                            │
│                                                                       │
│                        ┌─────────────────────┐                        │
│                        │   Global Control    │                        │
│                        │   Plane             │                        │
│                        │   ├── Tenant Config │                        │
│                        │   ├── Routing Rules │                        │
│                        │   └── Feature Flags │                        │
│                        └──────────┬──────────┘                        │
│                                   │                                   │
│         ┌─────────────────────────┼─────────────────────────┐         │
│         │                         │                         │         │
│         ▼                         ▼                         ▼         │
│  ┌─────────────┐          ┌─────────────┐          ┌─────────────┐    │
│  │  US Region  │          │  EU Region  │          │ APAC Region │    │
│  │             │          │             │          │             │    │
│  │ ┌─────────┐ │          │ ┌─────────┐ │          │ ┌─────────┐ │    │
│  │ │   CDN   │ │          │ │   CDN   │ │          │ │   CDN   │ │    │
│  │ └────┬────┘ │          │ └────┬────┘ │          │ └────┬────┘ │    │
│  │      │      │          │      │      │          │      │      │    │
│  │ ┌────▼────┐ │          │ ┌────▼────┐ │          │ ┌────▼────┐ │    │
│  │ │   WAF   │ │          │ │   WAF   │ │          │ │   WAF   │ │    │
│  │ └────┬────┘ │          │ └────┬────┘ │          │ └────┬────┘ │    │
│  │      │      │          │      │      │          │      │      │    │
│  │ ┌────▼────┐ │          │ ┌────▼────┐ │          │ ┌────▼────┐ │    │
│  │ │   ALB   │ │          │ │   ALB   │ │          │ │   ALB   │ │    │
│  │ └────┬────┘ │          │ └────┬────┘ │          │ └────┬────┘ │    │
│  │      │      │          │      │      │          │      │      │    │
│  │ ┌────▼────┐ │          │ ┌────▼────┐ │          │ ┌────▼────┐ │    │
│  │ │   API   │ │          │ │   API   │ │          │ │   API   │ │    │
│  │ │ Cluster │ │          │ │ Cluster │ │          │ │ Cluster │ │    │
│  │ └────┬────┘ │          │ └────┬────┘ │          │ └────┬────┘ │    │
│  │      │      │          │      │      │          │      │      │    │
│  │ ┌────┴────┐ │          │ ┌────┴────┐ │          │ ┌────┴────┐ │    │
│  │ │         │ │          │ │         │ │          │ │         │ │    │
│  │ ▼         ▼ │          │ ▼         ▼ │          │ ▼         ▼ │    │
│  │┌───┐   ┌───┐│          │┌───┐   ┌───┐│          │┌───┐   ┌───┐│    │
│  ││ PG│   │ ES││          ││ PG│   │ ES││          ││ PG│   │ ES││    │
│  │└───┘   └───┘│          │└───┘   └───┘│          │└───┘   └───┘│    │
│  │      │      │          │      │      │          │      │      │    │
│  │ ┌────┴────┐ │          │ ┌────┴────┐ │          │ ┌────┴────┐ │    │
│  │ │   S3    │ │          │ │   S3    │ │          │ │   S3    │ │    │
│  │ │ Bucket  │ │          │ │ Bucket  │ │          │ │ Bucket  │ │    │
│  │ └─────────┘ │          │ └─────────┘ │          │ └─────────┘ │    │
│  │             │          │             │          │             │    │
│  │  US DATA    │          │  EU DATA    │          │ APAC DATA   │    │
│  │  ONLY       │          │  ONLY       │          │ ONLY        │    │
│  └─────────────┘          └─────────────┘          └─────────────┘    │
│                                                                       │
│  NO CROSS-REGION DATA FLOW FOR TENANT DATA                            │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

Component Breakdown

You: "Let me walk through each major component and the key design decisions."

1. Global Control Plane

Purpose: Manage tenant configuration and routing without storing personal data.

Global Control Plane contains:
├── Tenant registry (which region each tenant is in)
├── Feature flags (which features enabled per tenant)
├── Plan/quota configuration
├── Routing rules
└── Global admin interface

Does NOT contain:
├── User data
├── Document content
├── Any personal information
└── Audit logs (those stay regional)

Technology: Single PostgreSQL with read replicas in each region for low-latency lookups.

2. Regional Data Plane

Purpose: Store and process all tenant data within the region.

Each region has independent:
├── API cluster (stateless, horizontally scaled)
├── PostgreSQL (users, documents metadata, permissions)
├── Elasticsearch (search index)
├── S3 (document storage)
├── Redis (cache, sessions, rate limiting)
└── Kafka (event streaming, audit logs)

Key Decision: Complete data isolation. EU tenant data never leaves EU region.

3. Document Processing Pipeline

Document Upload Flow:

User uploads document
        │
        ▼
┌───────────────────┐
│   API Gateway     │ ← Validates, rate limits
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│  Upload Service   │ ← Generates presigned S3 URL
└─────────┬─────────┘
          │
          ├───────────────────┐
          │                   │
          ▼                   ▼
┌───────────────────┐  ┌───────────────────┐
│   S3 (encrypted)  │  │   PostgreSQL      │
│   Store document  │  │   Store metadata  │
└─────────┬─────────┘  └───────────────────┘
          │
          ▼
┌───────────────────┐
│  Processing Queue │ ← Async processing
└─────────┬─────────┘
          │
    ┌─────┴─────┐
    │           │
    ▼           ▼
┌────────┐  ┌──────────┐
│ Text   │  │Thumbnail │
│Extract │  │Generate  │
└───┬────┘  └──────────┘
    │
    ▼
┌───────────────────┐
│  Elasticsearch    │ ← Index for search
│  Update index     │
└───────────────────┘

4. Search Architecture

Search Flow:

User searches "contract 2024"
        │
        ▼
┌───────────────────┐
│   API Gateway     │
└─────────┬─────────┘
          │
          ▼
┌───────────────────────────────────────┐
│  Search Service                       │
│  ├── Add tenant_id filter (CRITICAL!) │
│  ├── Add permission filter            │
│  └── Build ES query                   │
└─────────┬─────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────┐
│  Elasticsearch                      │
│  ├── Tenant-specific index OR.      |
│  └── Filtered query with tenant_id  |
└─────────┬───────────────────────────┘
          │
          ▼
┌───────────────────┐
│  Results with     │
│  permission check │
└───────────────────┘

Phase 4: Deep Dives (20 minutes)

Interviewer: "Great high-level design. Let's dive deeper into some areas. First, tell me about tenant isolation. How do you ensure one tenant can never see another's data?"

Deep Dive 1: Tenant Isolation (Week 9, Day 1)

You: "Tenant isolation is the most critical aspect of this system. I'll implement defense in depth with multiple isolation layers."

The Problem

WITHOUT PROPER ISOLATION:

Tenant A searches for "confidential"
        │
        ▼
SELECT * FROM documents 
WHERE content LIKE '%confidential%'
        │
        ▼
Returns documents from ALL tenants!
Including Tenant B's confidential contracts.

This is a catastrophic data breach.

Multi-Layer Isolation Strategy

┌────────────────────────────────────────────────────────────────────────┐
│                    TENANT ISOLATION LAYERS                             │
│                                                                        │
│  LAYER 1: REQUEST ROUTING                                              │
│  ├── Tenant determined from subdomain (acme.docplatform.com)           │
│  ├── Or from JWT token claims                                          │
│  ├── Set in request context at API gateway                             │
│  └── CANNOT be overridden by request parameters                        │
│                                                                        │
│  LAYER 2: APPLICATION ENFORCEMENT                                      │
│  ├── TenantContext set for every request                               │
│  ├── Repository layer auto-filters by tenant_id                        │
│  ├── Service layer validates tenant ownership                          │
│  └── Audit log records tenant context                                  │
│                                                                        │
│  LAYER 3: DATABASE ENFORCEMENT (RLS)                                   │
│  ├── Row Level Security policies on all tables                         │
│  ├── Database session has current_tenant set                           │
│  ├── Even raw SQL queries are filtered                                 │
│  └── DBA cannot accidentally see cross-tenant                          │
│                                                                        │
│  LAYER 4: SEARCH INDEX ISOLATION                                       │
│  ├── Option A: Index per tenant (strong isolation)                     │
│  ├── Option B: Shared index with tenant_id filter                      │
│  ├── We use Option A for enterprise, B for standard                    │
│  └── Search service enforces tenant filter                             │
│                                                                        │
│  LAYER 5: STORAGE ISOLATION                                            │
│  ├── S3 path includes tenant_id                                        │
│  ├── IAM policies restrict cross-tenant access                         │
│  ├── Presigned URLs scoped to tenant prefix                            │
│  └── Encryption keys per tenant (enterprise)                           │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Implementation

# tenant_isolation.py - Core tenant context management

from contextvars import ContextVar
from dataclasses import dataclass
from typing import Optional
from enum import Enum

# Thread-safe tenant context
_tenant_context: ContextVar[Optional['TenantContext']] = ContextVar(
    'tenant_context', default=None
)


class IsolationLevel(Enum):
    """Tenant isolation levels by plan."""
    SHARED = "shared"           # Shared tables, RLS
    SCHEMA = "schema"           # Schema per tenant
    DATABASE = "database"       # Database per tenant
    DEDICATED = "dedicated"     # Dedicated infrastructure


@dataclass(frozen=True)
class TenantContext:
    """Immutable tenant context for request processing."""
    tenant_id: str
    tenant_name: str
    region: str
    plan: str
    isolation_level: IsolationLevel
    encryption_key_id: str
    features: frozenset
    

def get_current_tenant() -> TenantContext:
    """Get current tenant context. Raises if not set."""
    ctx = _tenant_context.get()
    if ctx is None:
        raise TenantContextError("No tenant context set")
    return ctx


class TenantMiddleware:
    """
    Middleware that establishes tenant context for every request.
    
    Tenant is determined from:
    1. Subdomain (acme.docplatform.com → acme)
    2. JWT token claims
    3. API key lookup
    
    NEVER from request body or query parameters.
    """
    
    async def __call__(self, request, call_next):
        # Extract tenant from subdomain
        host = request.headers.get("host", "")
        subdomain = host.split(".")[0]
        
        # Or from JWT
        if not subdomain or subdomain in ["www", "api"]:
            token = request.headers.get("authorization", "").replace("Bearer ", "")
            claims = decode_jwt(token)
            subdomain = claims.get("tenant_id")
        
        if not subdomain:
            return JSONResponse({"error": "Tenant not identified"}, status_code=400)
        
        # Load tenant configuration
        tenant_config = await self.tenant_service.get_tenant(subdomain)
        
        if not tenant_config:
            return JSONResponse({"error": "Tenant not found"}, status_code=404)
        
        # Set immutable context
        context = TenantContext(
            tenant_id=tenant_config.id,
            tenant_name=tenant_config.name,
            region=tenant_config.region,
            plan=tenant_config.plan,
            isolation_level=IsolationLevel(tenant_config.isolation_level),
            encryption_key_id=tenant_config.encryption_key_id,
            features=frozenset(tenant_config.features)
        )
        
        # Set context for this request
        token = _tenant_context.set(context)
        
        try:
            response = await call_next(request)
            return response
        finally:
            _tenant_context.reset(token)


class TenantAwareRepository:
    """
    Repository base class that enforces tenant isolation.
    
    ALL database access goes through this class.
    """
    
    def __init__(self, db_pool):
        self.db = db_pool
    
    async def find_by_id(self, table: str, id: str):
        """Find record by ID within current tenant."""
        tenant = get_current_tenant()
        
        # ALWAYS filter by tenant_id
        return await self.db.fetchone(
            f"SELECT * FROM {table} WHERE id = $1 AND tenant_id = $2",
            id, tenant.tenant_id
        )
    
    async def find_all(self, table: str, filters: dict = None):
        """Find all records within current tenant."""
        tenant = get_current_tenant()
        
        query = f"SELECT * FROM {table} WHERE tenant_id = $1"
        params = [tenant.tenant_id]
        
        if filters:
            for i, (key, value) in enumerate(filters.items(), start=2):
                query += f" AND {key} = ${i}"
                params.append(value)
        
        return await self.db.fetch(query, *params)
    
    async def create(self, table: str, data: dict):
        """Create record with automatic tenant_id."""
        tenant = get_current_tenant()
        
        # Force tenant_id - cannot be overridden
        data["tenant_id"] = tenant.tenant_id
        
        columns = ", ".join(data.keys())
        placeholders = ", ".join(f"${i+1}" for i in range(len(data)))
        
        return await self.db.fetchone(
            f"INSERT INTO {table} ({columns}) VALUES ({placeholders}) RETURNING *",
            *data.values()
        )

Database Row-Level Security

-- PostgreSQL Row-Level Security Setup

-- Enable RLS on all tenant tables
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
ALTER TABLE folders ENABLE ROW LEVEL SECURITY;
ALTER TABLE comments ENABLE ROW LEVEL SECURITY;
ALTER TABLE audit_logs ENABLE ROW LEVEL SECURITY;

-- Create policy that filters by current tenant
CREATE POLICY tenant_isolation_policy ON documents
    FOR ALL
    USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

CREATE POLICY tenant_isolation_policy ON users
    FOR ALL
    USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

-- Similar for all tables...

-- Application sets tenant before any query
-- SET app.current_tenant_id = 'tenant-uuid-here';

-- Now even this query is filtered:
-- SELECT * FROM documents; 
-- Only returns current tenant's documents!

Interviewer: "What about the large enterprise tenant with 10M documents? Don't they need stronger isolation?"

You: "Absolutely. For enterprise customers, we offer dedicated isolation."

Enterprise Tier Isolation

# enterprise_isolation.py

class EnterpriseTenantManager:
    """
    Manages dedicated resources for enterprise tenants.
    """
    
    async def provision_enterprise_tenant(
        self,
        tenant_id: str,
        config: EnterpriseConfig
    ):
        """
        Provision dedicated resources for enterprise tenant.
        
        Options:
        - Dedicated database
        - Dedicated Elasticsearch index
        - Dedicated encryption keys
        - Dedicated compute (optional)
        """
        
        # 1. Create dedicated database
        if config.dedicated_database:
            await self._create_tenant_database(tenant_id)
        
        # 2. Create dedicated search index
        await self._create_tenant_search_index(tenant_id)
        
        # 3. Create tenant-specific KMS key
        key_id = await self._create_tenant_kms_key(tenant_id)
        
        # 4. Update tenant config
        await self.tenant_service.update_tenant(
            tenant_id,
            isolation_level=IsolationLevel.DATABASE,
            encryption_key_id=key_id,
            database_name=f"tenant_{tenant_id}",
            search_index=f"documents_{tenant_id}"
        )
    
    async def _create_tenant_database(self, tenant_id: str):
        """Create dedicated PostgreSQL database for tenant."""
        safe_name = tenant_id.replace("-", "_")
        
        await self.admin_db.execute(f"""
            CREATE DATABASE tenant_{safe_name};
        """)
        
        # Run migrations on new database
        await self._run_migrations(f"tenant_{safe_name}")
    
    async def _create_tenant_kms_key(self, tenant_id: str) -> str:
        """Create dedicated KMS key for tenant encryption."""
        response = await self.kms.create_key(
            Description=f"Encryption key for tenant {tenant_id}",
            KeyUsage="ENCRYPT_DECRYPT",
            Tags=[
                {"TagKey": "tenant_id", "TagValue": tenant_id},
                {"TagKey": "purpose", "TagValue": "document_encryption"}
            ]
        )
        return response["KeyMetadata"]["KeyId"]

Deep Dive 2: Noisy Neighbor Prevention (Week 9, Day 2)

Interviewer: "You mentioned the largest tenant has 10M documents. How do you prevent them from impacting smaller tenants?"

You: "This is the noisy neighbor problem. I'd implement quotas and fair scheduling at multiple levels."

Resource Quota System

┌────────────────────────────────────────────────────────────────────────┐
│                    QUOTA SYSTEM BY PLAN                                │
│                                                                        │
│  Resource              │ Standard   │ Professional │ Enterprise        │
│  ──────────────────────┼────────────┼──────────────┼─────────────────  │
│  API calls/minute      │ 1,000      │ 10,000       │ 100,000           │
│  Search calls/minute   │ 100        │ 1,000        │ 10,000            │
│  Storage (GB)          │ 100        │ 1,000        │ 10,000            │
│  Documents             │ 100,000    │ 1,000,000    │ Unlimited*        │
│  Users                 │ 100        │ 1,000        │ Unlimited         │
│  Concurrent uploads    │ 5          │ 20           │ 100               │
│  Max file size (MB)    │ 50         │ 200          │ 500               │
│  Search query timeout  │ 10s        │ 30s          │ 60s               │
│  Export rate (docs/hr) │ 10,000     │ 100,000      │ 1,000,000         │
│                                                                        │
│  * "Unlimited" = 100M with fair use policy                             │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Implementation

# noisy_neighbor/rate_limiter.py

class TenantResourceManager:
    """
    Manages and enforces resource quotas per tenant.
    """
    
    def __init__(self, redis, quota_config):
        self.redis = redis
        self.quotas = quota_config
    
    async def check_rate_limit(
        self,
        tenant_id: str,
        resource: str,
        cost: int = 1
    ) -> RateLimitResult:
        """
        Check if request is within rate limits.
        
        Uses token bucket algorithm with per-tenant buckets.
        """
        tenant = await self.get_tenant_config(tenant_id)
        limit = self.quotas[tenant.plan][resource]
        
        key = f"ratelimit:{tenant_id}:{resource}"
        
        # Lua script for atomic check-and-decrement
        script = """
        local key = KEYS[1]
        local limit = tonumber(ARGV[1])
        local cost = tonumber(ARGV[2])
        local window = tonumber(ARGV[3])
        local now = tonumber(ARGV[4])
        
        -- Get current count
        local current = tonumber(redis.call('GET', key) or '0')
        
        if current + cost <= limit then
            redis.call('INCRBY', key, cost)
            redis.call('EXPIRE', key, window)
            return {1, limit - current - cost}
        else
            local ttl = redis.call('TTL', key)
            return {0, ttl}
        end
        """
        
        result = await self.redis.eval(
            script,
            keys=[key],
            args=[limit, cost, 60, time.time()]
        )
        
        allowed = result[0] == 1
        
        if not allowed:
            await self._record_throttle(tenant_id, resource)
        
        return RateLimitResult(
            allowed=allowed,
            remaining=result[1] if allowed else 0,
            retry_after=result[1] if not allowed else None
        )
    
    async def check_concurrent_limit(
        self,
        tenant_id: str,
        operation: str
    ) -> bool:
        """
        Check concurrent operation limit.
        
        Prevents one tenant from using all worker capacity.
        """
        tenant = await self.get_tenant_config(tenant_id)
        limit = self.quotas[tenant.plan][f"concurrent_{operation}"]
        
        key = f"concurrent:{tenant_id}:{operation}"
        current = await self.redis.get(key) or 0
        
        return int(current) < limit
    
    async def acquire_concurrent_slot(
        self,
        tenant_id: str,
        operation: str,
        operation_id: str,
        ttl: int = 300
    ) -> bool:
        """Acquire a concurrent operation slot."""
        if not await self.check_concurrent_limit(tenant_id, operation):
            return False
        
        key = f"concurrent:{tenant_id}:{operation}"
        await self.redis.sadd(key, operation_id)
        await self.redis.expire(key, ttl)
        return True
    
    async def release_concurrent_slot(
        self,
        tenant_id: str,
        operation: str,
        operation_id: str
    ):
        """Release a concurrent operation slot."""
        key = f"concurrent:{tenant_id}:{operation}"
        await self.redis.srem(key, operation_id)


class SearchQueryGuard:
    """
    Guards search queries against noisy neighbors.
    """
    
    async def guard_search(
        self,
        tenant_id: str,
        query: SearchQuery
    ) -> tuple[bool, Optional[int]]:
        """
        Check if search query should be allowed.
        
        Returns (allowed, timeout_seconds).
        """
        tenant = await self.tenant_service.get_tenant(tenant_id)
        
        # Get tenant's search timeout
        timeout = self.quotas[tenant.plan]["search_timeout"]
        
        # Estimate query cost
        cost = self._estimate_query_cost(query)
        
        if cost > 100:  # High cost query
            # Check if tenant can run expensive queries
            if tenant.plan == "standard":
                return False, None
            
            # Use longer timeout for expensive queries
            timeout = min(timeout * 2, 120)
        
        return True, timeout
    
    def _estimate_query_cost(self, query: SearchQuery) -> int:
        """Estimate query cost based on complexity."""
        cost = 1
        
        # Wildcard queries are expensive
        if "*" in query.text or "?" in query.text:
            cost += 10
        
        # Regex queries are very expensive
        if query.use_regex:
            cost += 50
        
        # Large result sets
        if query.limit > 100:
            cost += query.limit // 100
        
        return cost

Fair Scheduling for Background Jobs

# noisy_neighbor/fair_scheduler.py

class TenantFairQueue:
    """
    Fair queue that prevents one tenant from monopolizing workers.
    
    Uses weighted fair queuing based on tenant plan.
    """
    
    PLAN_WEIGHTS = {
        "standard": 1,
        "professional": 3,
        "enterprise": 10
    }
    
    async def enqueue(
        self,
        tenant_id: str,
        job: Job
    ):
        """Add job to tenant's queue."""
        tenant = await self.tenant_service.get_tenant(tenant_id)
        
        # Check queue depth limit
        current_depth = await self.get_queue_depth(tenant_id)
        max_depth = self.quotas[tenant.plan]["max_queue_depth"]
        
        if current_depth >= max_depth:
            raise QueueFullError(
                f"Queue full for tenant {tenant_id}. "
                f"Current: {current_depth}, Max: {max_depth}"
            )
        
        # Add to tenant's queue
        await self.redis.lpush(
            f"queue:{tenant_id}",
            job.serialize()
        )
    
    async def dequeue(self) -> Optional[tuple[str, Job]]:
        """
        Dequeue next job using weighted fair scheduling.
        
        Higher-weight tenants get proportionally more slots.
        """
        # Get all tenant queues with pending jobs
        tenant_queues = await self._get_active_queues()
        
        if not tenant_queues:
            return None
        
        # Calculate weighted selection
        weighted_tenants = []
        for tenant_id, depth in tenant_queues.items():
            tenant = await self.tenant_service.get_tenant(tenant_id)
            weight = self.PLAN_WEIGHTS.get(tenant.plan, 1)
            # Weight decreases as queue depth increases (fairness)
            adjusted_weight = weight / (1 + depth * 0.1)
            weighted_tenants.append((tenant_id, adjusted_weight))
        
        # Select tenant based on weights
        selected_tenant = self._weighted_random_choice(weighted_tenants)
        
        # Pop job from selected tenant's queue
        job_data = await self.redis.rpop(f"queue:{selected_tenant}")
        
        if job_data:
            return selected_tenant, Job.deserialize(job_data)
        
        return None

Interviewer: "You mentioned EU customers need data in EU. How exactly does that work, and what about GDPR compliance?"

You: "Data residency and GDPR are handled through regional isolation and comprehensive data management. Let me walk through both."

Regional Data Routing

# data_residency/router.py

class RegionalDataRouter:
    """
    Routes data operations to the correct region based on tenant.
    
    EU tenant data NEVER touches US infrastructure.
    """
    
    REGION_CONFIGS = {
        "us": {
            "database": "postgres://db-us.internal:5432/app",
            "elasticsearch": "https://es-us.internal:9200",
            "s3_bucket": "documents-us-east-1",
            "redis": "redis://cache-us.internal:6379"
        },
        "eu": {
            "database": "postgres://db-eu.internal:5432/app",
            "elasticsearch": "https://es-eu.internal:9200",
            "s3_bucket": "documents-eu-central-1",
            "redis": "redis://cache-eu.internal:6379"
        },
        "apac": {
            "database": "postgres://db-apac.internal:5432/app",
            "elasticsearch": "https://es-apac.internal:9200",
            "s3_bucket": "documents-ap-southeast-1",
            "redis": "redis://cache-apac.internal:6379"
        }
    }
    
    async def get_database_connection(self, tenant_id: str):
        """Get database connection for tenant's region."""
        tenant = await self.tenant_service.get_tenant(tenant_id)
        config = self.REGION_CONFIGS[tenant.region]
        
        return await self.connection_pools[tenant.region].acquire()
    
    async def get_storage_bucket(self, tenant_id: str) -> str:
        """Get S3 bucket for tenant's region."""
        tenant = await self.tenant_service.get_tenant(tenant_id)
        return self.REGION_CONFIGS[tenant.region]["s3_bucket"]
    
    async def upload_document(
        self,
        tenant_id: str,
        document_id: str,
        content: bytes
    ) -> str:
        """Upload document to tenant's regional storage."""
        tenant = await self.tenant_service.get_tenant(tenant_id)
        bucket = self.REGION_CONFIGS[tenant.region]["s3_bucket"]
        
        # Path includes tenant for isolation
        key = f"tenants/{tenant_id}/documents/{document_id}"
        
        # Encrypt with tenant's key before upload
        encrypted = await self.encrypt_for_tenant(tenant_id, content)
        
        await self.s3.put_object(
            Bucket=bucket,
            Key=key,
            Body=encrypted,
            ServerSideEncryption="aws:kms",
            SSEKMSKeyId=tenant.encryption_key_id
        )
        
        return f"s3://{bucket}/{key}"

# gdpr/consent.py

class GDPRConsentService:
    """
    Manages user consent for GDPR compliance.
    """
    
    CONSENT_PURPOSES = [
        "service_delivery",      # Required for service
        "analytics",             # Product analytics
        "marketing_email",       # Marketing communications
        "third_party_sharing",   # Sharing with partners
    ]
    
    async def record_consent(
        self,
        user_id: str,
        tenant_id: str,
        purpose: str,
        granted: bool,
        ip_address: str,
        consent_text: str
    ) -> ConsentRecord:
        """
        Record a consent decision.
        
        Creates immutable audit record.
        """
        record = ConsentRecord(
            id=str(uuid.uuid4()),
            user_id=user_id,
            tenant_id=tenant_id,
            purpose=purpose,
            status="granted" if granted else "denied",
            granted_at=datetime.utcnow() if granted else None,
            ip_address=ip_address,
            consent_text=consent_text,
            consent_version="2024-01"
        )
        
        # Store in regional database
        await self.db.execute(
            """
            INSERT INTO consent_records 
            (id, user_id, tenant_id, purpose, status, granted_at,
             ip_address, consent_text, consent_version, created_at)
            VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
            """,
            record.id, record.user_id, record.tenant_id,
            record.purpose, record.status, record.granted_at,
            record.ip_address, record.consent_text,
            record.consent_version, datetime.utcnow()
        )
        
        # Publish event for downstream systems
        await self.events.publish("consent", {
            "type": "consent.recorded",
            "user_id": user_id,
            "purpose": purpose,
            "granted": granted
        })
        
        return record
    
    async def check_consent(
        self,
        user_id: str,
        tenant_id: str,
        purpose: str
    ) -> bool:
        """Check if user has consented to a purpose."""
        result = await self.db.fetchone(
            """
            SELECT status FROM consent_records
            WHERE user_id = $1 AND tenant_id = $2 AND purpose = $3
            ORDER BY created_at DESC
            LIMIT 1
            """,
            user_id, tenant_id, purpose
        )
        
        return result and result["status"] == "granted"

Data Export (Right to Portability)

# gdpr/export.py

class GDPRDataExporter:
    """
    Exports user data for GDPR portability requests.
    """
    
    async def export_user_data(
        self,
        user_id: str,
        tenant_id: str
    ) -> DataExportResult:
        """
        Export all user's personal data.
        
        GDPR Article 20: Right to data portability
        """
        export_id = str(uuid.uuid4())
        
        # Collect data from all sources
        data = {
            "export_metadata": {
                "export_id": export_id,
                "exported_at": datetime.utcnow().isoformat(),
                "user_id": user_id,
                "tenant_id": tenant_id
            },
            "profile": await self._export_profile(user_id, tenant_id),
            "documents": await self._export_documents(user_id, tenant_id),
            "comments": await self._export_comments(user_id, tenant_id),
            "activity_history": await self._export_activity(user_id, tenant_id),
            "consent_records": await self._export_consent(user_id, tenant_id),
        }
        
        # Package as JSON
        export_json = json.dumps(data, indent=2, default=str)
        
        # Also create ZIP with actual files
        zip_buffer = await self._create_export_zip(user_id, tenant_id, data)
        
        # Upload to tenant's regional storage
        bucket = await self.router.get_storage_bucket(tenant_id)
        export_key = f"exports/{tenant_id}/{user_id}/{export_id}.zip"
        
        await self.s3.put_object(
            Bucket=bucket,
            Key=export_key,
            Body=zip_buffer.getvalue()
        )
        
        # Generate download link (expires in 7 days)
        download_url = await self.s3.generate_presigned_url(
            "get_object",
            Params={"Bucket": bucket, "Key": export_key},
            ExpiresIn=604800  # 7 days
        )
        
        return DataExportResult(
            export_id=export_id,
            download_url=download_url,
            expires_at=datetime.utcnow() + timedelta(days=7),
            size_bytes=len(zip_buffer.getvalue())
        )

Deep Dive 4: Right to Deletion (Week 9, Day 4)

Interviewer: "When a user requests deletion, how do you ensure all their data is removed from all those systems?"

You: "Deletion is one of the hardest compliance requirements. I'd implement a coordinated deletion workflow with verification."

Deletion Orchestration

# gdpr/deletion.py

class UserDeletionService:
    """
    Orchestrates user data deletion across all systems.
    """
    
    # Systems in deletion order (dependencies first)
    DELETION_TARGETS = [
        ("cache", "redis", 1),           # Clear cache first
        ("search", "elasticsearch", 2),   # Remove from search
        ("storage", "s3", 3),             # Delete files
        ("analytics", "bigquery", 4),     # Remove from analytics
        ("database", "postgresql", 10),   # Primary DB last
    ]
    
    async def process_deletion_request(
        self,
        user_id: str,
        tenant_id: str,
        requested_by: str
    ) -> DeletionRequest:
        """
        Process a GDPR deletion request.
        
        Must complete within 30 days per GDPR.
        """
        request = DeletionRequest(
            id=str(uuid.uuid4()),
            user_id=user_id,
            tenant_id=tenant_id,
            requested_at=datetime.utcnow(),
            requested_by=requested_by,
            deadline=datetime.utcnow() + timedelta(days=30),
            status="pending"
        )
        
        # Store request
        await self._save_request(request)
        
        # Execute deletion workflow
        try:
            await self._execute_deletion(request)
            
            # Verify deletion
            verification = await self._verify_deletion(request)
            
            if verification.all_verified:
                request.status = "completed"
                request.completed_at = datetime.utcnow()
            else:
                request.status = "partial"
                request.issues = verification.issues
                
        except Exception as e:
            request.status = "failed"
            request.error = str(e)
            raise
        
        finally:
            await self._save_request(request)
            await self._notify_user(request)
        
        return request
    
    async def _execute_deletion(self, request: DeletionRequest):
        """Execute deletion across all systems."""
        
        for system_type, system_name, priority in sorted(
            self.DELETION_TARGETS, key=lambda x: x[2]
        ):
            target = DeletionTarget(
                system_name=system_name,
                system_type=system_type,
                status="pending"
            )
            
            try:
                executor = self.executors[system_name]
                result = await executor.delete_user_data(
                    request.user_id,
                    request.tenant_id
                )
                
                target.status = "completed"
                target.records_deleted = result.get("records_deleted", 0)
                
                await self.audit.log(
                    action="deletion_executed",
                    system=system_name,
                    user_id=request.user_id,
                    records_deleted=target.records_deleted
                )
                
            except Exception as e:
                target.status = "failed"
                target.error = str(e)
                raise
            
            request.targets.append(target)
    
    async def _verify_deletion(
        self,
        request: DeletionRequest
    ) -> VerificationResult:
        """Verify that deletion was successful."""
        issues = []
        
        for target in request.targets:
            executor = self.executors[target.system_name]
            
            still_exists = await executor.check_user_exists(
                request.user_id,
                request.tenant_id
            )
            
            if still_exists:
                issues.append(f"Data still exists in {target.system_name}")
        
        return VerificationResult(
            all_verified=len(issues) == 0,
            issues=issues
        )


class PostgreSQLDeletionExecutor:
    """
    Deletes user data from PostgreSQL.
    """
    
    async def delete_user_data(
        self,
        user_id: str,
        tenant_id: str
    ) -> dict:
        """Delete user and related data."""
        records_deleted = 0
        
        async with self.db.transaction():
            # Delete from leaf tables first
            
            # Comments (anonymize, keep content)
            result = await self.db.execute(
                """
                UPDATE comments 
                SET user_id = NULL, author_name = 'Deleted User'
                WHERE user_id = $1 AND tenant_id = $2
                """,
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
            
            # Activity logs (anonymize)
            result = await self.db.execute(
                """
                UPDATE activity_logs
                SET user_id = 'DELETED', ip_address = 'DELETED'
                WHERE user_id = $1 AND tenant_id = $2
                """,
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
            
            # Documents (reassign ownership to admin or delete)
            result = await self.db.execute(
                """
                UPDATE documents
                SET owner_id = (
                    SELECT id FROM users 
                    WHERE tenant_id = $2 AND 'admin' = ANY(roles)
                    LIMIT 1
                )
                WHERE owner_id = $1 AND tenant_id = $2
                """,
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
            
            # Consent records (keep anonymized for audit)
            result = await self.db.execute(
                """
                UPDATE consent_records
                SET user_id = 'DELETED', ip_address = 'DELETED'
                WHERE user_id = $1 AND tenant_id = $2
                """,
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
            
            # Finally, delete user
            result = await self.db.execute(
                "DELETE FROM users WHERE id = $1 AND tenant_id = $2",
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
        
        return {"records_deleted": records_deleted}

Deep Dive 5: Security Architecture (Week 9, Day 5)

Interviewer: "Let's talk security. How do you protect this system, especially with multiple tenants?"

You: "Security is defense in depth with zero trust principles. Let me walk through the layers."

Security Layers

┌────────────────────────────────────────────────────────────────────────┐
│                    SECURITY ARCHITECTURE                               │
│                                                                        │
│  LAYER 1: EDGE SECURITY                                                │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  CloudFront (CDN) → WAF → Rate Limiting → DDoS Protection       │   │
│  │  ├── OWASP Top 10 rules                                         │   │
│  │  ├── Bot detection                                              │   │
│  │  ├── IP reputation                                              │   │
│  │  └── Geo-blocking (optional per tenant)                         │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                        │
│  LAYER 2: NETWORK SECURITY                                             │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  VPC isolation:                                                 │   │
│  │  ├── Public subnet: ALB only                                    │   │
│  │  ├── Private subnet: App servers                                │   │
│  │  ├── Isolated subnet: Databases                                 │   │
│  │  └── Security groups: Explicit allow only                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                        │
│  LAYER 3: APPLICATION SECURITY                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  ├── Authentication (JWT + MFA)                                 │   │
│  │  ├── Authorization (RBAC + tenant isolation)                    │   │
│  │  ├── Input validation (Pydantic schemas)                        │   │
│  │  ├── Output encoding                                            │   │
│  │  └── CSRF protection                                            │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                        │
│  LAYER 4: DATA SECURITY                                                │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  ├── Encryption in transit (TLS 1.3)                            │   │
│  │  ├── Encryption at rest (AES-256)                               │   │
│  │  ├── Per-tenant encryption keys (enterprise)                    │   │
│  │  ├── Secrets in Vault                                           │   │
│  │  └── Data classification and handling                           │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                        │
│  LAYER 5: MONITORING & DETECTION                                       │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  ├── Audit logging (all access)                                 │   │
│  │  ├── Anomaly detection                                          │   │
│  │  ├── SIEM integration                                           │   │
│  │  └── Incident response automation                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Authentication and Authorization

# security/auth.py

class AuthenticationService:
    """
    Multi-tenant authentication service.
    """
    
    async def authenticate(
        self,
        email: str,
        password: str,
        tenant_id: str,
        ip_address: str
    ) -> AuthResult:
        """
        Authenticate user within their tenant.
        """
        # Rate limiting per IP + tenant
        if await self._is_rate_limited(ip_address, tenant_id):
            raise AuthError("Too many attempts")
        
        # Find user in tenant
        user = await self.db.fetchone(
            """
            SELECT id, email, password_hash, roles, mfa_enabled, status
            FROM users
            WHERE email = $1 AND tenant_id = $2
            """,
            email.lower(), tenant_id
        )
        
        if not user:
            await self._record_failed_attempt(ip_address, tenant_id)
            raise AuthError("Invalid credentials")
        
        if user["status"] != "active":
            raise AuthError("Account disabled")
        
        # Verify password
        if not bcrypt.checkpw(password.encode(), user["password_hash"].encode()):
            await self._record_failed_attempt(ip_address, tenant_id)
            raise AuthError("Invalid credentials")
        
        # Create session
        session_id = secrets.token_urlsafe(32)
        
        result = AuthResult(
            user_id=user["id"],
            tenant_id=tenant_id,
            roles=user["roles"],
            mfa_required=user["mfa_enabled"],
            session_id=session_id
        )
        
        # Audit log
        await self.audit.log(
            action="login_success",
            user_id=user["id"],
            tenant_id=tenant_id,
            ip_address=ip_address
        )
        
        return result
    
    async def create_jwt(self, auth_result: AuthResult) -> str:
        """Create JWT with tenant claims."""
        signing_key = await self.secrets.get_secret("jwt/signing_key")
        
        payload = {
            "sub": auth_result.user_id,
            "tenant_id": auth_result.tenant_id,
            "roles": auth_result.roles,
            "session_id": auth_result.session_id,
            "iat": datetime.utcnow(),
            "exp": datetime.utcnow() + timedelta(minutes=15)
        }
        
        return jwt.encode(payload, signing_key.value, algorithm="RS256")


class AuthorizationService:
    """
    Multi-tenant authorization with RBAC.
    """
    
    async def check_document_access(
        self,
        user_id: str,
        tenant_id: str,
        document_id: str,
        required_permission: str
    ) -> bool:
        """
        Check if user can access document.
        
        Enforces:
        1. Tenant isolation (user's tenant == document's tenant)
        2. Role-based permission
        3. Document-level permission
        """
        # Get document
        document = await self.db.fetchone(
            """
            SELECT tenant_id, owner_id, permissions
            FROM documents
            WHERE id = $1
            """,
            document_id
        )
        
        if not document:
            return False
        
        # CRITICAL: Tenant isolation check
        if document["tenant_id"] != tenant_id:
            await self.audit.log(
                action="access_denied",
                reason="tenant_mismatch",
                user_id=user_id,
                document_id=document_id
            )
            return False
        
        # Check ownership
        if document["owner_id"] == user_id:
            return True
        
        # Check document permissions
        permissions = document["permissions"] or {}
        user_permission = permissions.get(user_id)
        
        if user_permission:
            return self._has_permission(user_permission, required_permission)
        
        # Check folder permissions (inherited)
        # ... folder permission check logic ...
        
        return False

Comprehensive Audit Logging

# security/audit.py

class AuditService:
    """
    Comprehensive audit logging for compliance.
    """
    
    async def log(
        self,
        action: str,
        **context
    ):
        """
        Log an audit event.
        
        All access, modifications, and security events are logged.
        """
        tenant = get_current_tenant()
        
        event = AuditEvent(
            id=str(uuid.uuid4()),
            timestamp=datetime.utcnow(),
            tenant_id=tenant.tenant_id if tenant else context.get("tenant_id"),
            action=action,
            actor_id=context.get("user_id"),
            actor_type=context.get("actor_type", "user"),
            resource_type=context.get("resource_type"),
            resource_id=context.get("resource_id"),
            ip_address=context.get("ip_address"),
            user_agent=context.get("user_agent"),
            details=context
        )
        
        # Write to regional audit log (immutable)
        await self.db.execute(
            """
            INSERT INTO audit_logs
            (id, timestamp, tenant_id, action, actor_id, actor_type,
             resource_type, resource_id, ip_address, user_agent, details)
            VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)
            """,
            event.id, event.timestamp, event.tenant_id, event.action,
            event.actor_id, event.actor_type, event.resource_type,
            event.resource_id, event.ip_address, event.user_agent,
            json.dumps(event.details)
        )
        
        # Also stream to Kafka for real-time monitoring
        await self.kafka.produce(
            topic="audit_events",
            key=event.tenant_id,
            value=event.to_dict()
        )

Phase 5: Scaling and Edge Cases (5 minutes)

Interviewer: "How would this system scale if we went from 500 to 5,000 tenants?"

Scaling Strategy

You: "The architecture is designed to scale horizontally. Here's how each component scales:"

┌────────────────────────────────────────────────────────────────────────┐
│                    SCALING STRATEGY                                    │
│                                                                        │
│  Component          │ Current      │ 10x Scale    │ How                │
│  ───────────────────┼──────────────┼──────────────┼──────────────────  │
│  API Servers        │ 20 instances │ 200 instances│ Auto-scaling group │
│  PostgreSQL         │ 2TB per region│ Sharding    │ By tenant_id       │
│  Elasticsearch      │ 3TB per region│ 30TB cluster│ Add nodes          │
│  S3                 │ 180TB        │ 1.8PB       │ Automatic           │
│  Redis              │ 50GB cluster │ 500GB cluster│ Add shards         │
│  Workers            │ 10 instances │ 100 instances│ Queue-based scale  │
│                                                                        │
│  KEY SCALING DECISIONS:                                                │
│                                                                        │
│  1. Database sharding by tenant_id                                     │
│     ├── Keeps tenant data together                                     │
│     ├── Enables tenant-level backup/restore                            │
│     └── Large tenants can get dedicated shards                         │
│                                                                        │
│  2. Search index per tenant (for large tenants)                        │
│     ├── Avoids hot spots                                               │
│     ├── Enables tenant-specific tuning                                 │
│     └── Easier to delete/migrate                                       │
│                                                                        │
│  3. Add more regions as needed                                         │
│     ├── Japan region for Japanese customers                            │
│     ├── Australia region for AU/NZ                                     │
│     └── Each region is independent                                     │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Edge Cases

Interviewer: "What edge cases should we handle?"

You: "Several important edge cases:"

EDGE CASES

1. TENANT OFFBOARDING
   └── Customer cancels subscription
       ├── 30-day grace period (can reactivate)
       ├── Export data to customer
       ├── Then complete deletion
       └── Retain anonymized audit logs

2. LARGE FILE UPLOADS
   └── Customer uploads 500MB presentation
       ├── Direct-to-S3 upload (presigned URL)
       ├── Chunked upload for resume
       ├── Async processing in background
       └── Progress tracking

3. SEARCH INDEX CORRUPTION
   └── Elasticsearch index gets corrupted
       ├── Detection: Scheduled consistency checks
       ├── Recovery: Rebuild from PostgreSQL
       ├── Tenant isolated: Only one tenant affected
       └── Automated healing with alerting

4. CROSS-TENANT SHARING (External links)
   └── User shares document externally
       ├── Generate unique, expiring token
       ├── Token tied to document, not tenant context
       ├── Audit log records external access
       └── Owner can revoke anytime

5. SSO PROVIDER OUTAGE
   └── Customer's SSO is down
       ├── Fallback to email/password
       ├── Requires pre-configured backup auth
       ├── Audit log notes SSO bypass
       └── Notify tenant admin

6. REGULATORY HOLD
   └── Legal hold prevents deletion
       ├── Mark documents as "held"
       ├── Deletion requests queued
       ├── User notified of delay
       └── Release when hold lifted

Phase 6: Monitoring and Operations (5 minutes)

Interviewer: "How would you monitor this system in production?"

Key Metrics

┌────────────────────────────────────────────────────────────────────────┐
│                      MONITORING DASHBOARD                              │
│                                                                        │
│  BUSINESS METRICS (per tenant)                                         │
│  ├── Active users                     [████████░░] 8,234/10,000        │
│  ├── Documents stored                 [██████████] 4.2M / 5M           │
│  ├── Storage used                     [███████░░░] 720GB / 1TB         │
│  └── API calls today                  [████░░░░░░] 42K / 100K          │
│                                                                        │
│  SYSTEM HEALTH (per region)                                            │
│  ├── API latency p99                  [██░░░░░░░░] 180ms (< 200ms)     │
│  ├── Search latency p99               [███░░░░░░░] 320ms (< 500ms)     │
│  ├── Error rate                       [█░░░░░░░░░] 0.02% (< 0.1%)      │
│  └── Throughput                       [████████░░] 800 req/s           │
│                                                                        │
│  SECURITY METRICS                                                      │
│  ├── Failed login attempts/hour       [██░░░░░░░░] 234                 │
│  ├── Cross-tenant access attempts     [░░░░░░░░░░] 0 (should be 0!)    │
│  ├── WAF blocked requests             [███░░░░░░░] 1,234/hour          │
│  └── MFA adoption                     [██████░░░░] 62%                 │
│                                                                        │
│  COMPLIANCE METRICS                                                    │
│  ├── Pending deletion requests        [█░░░░░░░░░] 3                   │
│  ├── Avg deletion completion time     [██░░░░░░░░] 4.2 days            │
│  ├── Data export requests/week        [█░░░░░░░░░] 12                  │
│  └── Audit log retention              [██████████] 7 years             │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Alerting Strategy

CRITICAL (PagerDuty - immediate response):
├── Cross-tenant data access detected
├── Database replication lag > 1 minute
├── Error rate > 1%
├── Any region unreachable
└── Security incident detected

WARNING (Slack - business hours):
├── Tenant approaching quota limits
├── Search latency > 400ms p99
├── Deletion request approaching deadline
├── Failed login spike for tenant
└── Certificate expiring < 30 days

INFO (Dashboard only):
├── New tenant onboarded
├── Large file upload completed
├── Scheduled maintenance
└── Feature flag changed

Interview Conclusion

Interviewer: "This is a comprehensive design. You've covered multi-tenancy, compliance, and security thoroughly. Any final thoughts?"

You: "A few things I'd prioritize for implementation:

Start with tenant isolation - Get this right first, it's the foundation
Build compliance into the architecture - Retrofitting GDPR is painful
Invest in observability early - Per-tenant metrics from day one
Plan for enterprise features - Dedicated resources, custom encryption
Security as code - Infrastructure as code, security policies as code

The key insight is that multi-tenant SaaS is harder than single-tenant because every feature needs to consider isolation, fairness, and compliance from the start."

Interviewer: "Excellent. Thanks for walking through this with me."

Summary: Week 9 Concepts Applied

Concepts by Day

Day	Topic	Application in Design
Day 1	Tenant Isolation	Multi-layer isolation (app, DB RLS, storage paths), enterprise dedicated resources
Day 2	Noisy Neighbor	Per-tenant quotas, rate limiting, fair scheduling, query guards
Day 3	Data Residency	Regional data planes, no cross-region data flow, consent management
Day 4	Right to Deletion	Deletion orchestration, verification, anonymization vs delete
Day 5	Security	Defense in depth, zero trust, encryption layers, audit logging

Code Patterns Demonstrated

1. TENANT CONTEXT MANAGEMENT
   ├── Immutable TenantContext dataclass
   ├── ContextVar for thread-safe propagation
   ├── Middleware sets context from JWT/subdomain
   └── All services use get_current_tenant()

2. REPOSITORY PATTERN WITH ISOLATION
   ├── TenantAwareRepository base class
   ├── Auto-adds tenant_id to all queries
   ├── RLS as database-level backup
   └── No raw SQL without tenant filter

3. REGIONAL DATA ROUTING
   ├── RegionalDataRouter for all data ops
   ├── Tenant config determines region
   ├── Each region has full stack
   └── Global control plane for metadata only

4. DELETION ORCHESTRATION
   ├── DeletionService coordinates
   ├── System-specific executors
   ├── Verification confirms deletion
   └── Audit trail survives deletion

5. DEFENSE IN DEPTH SECURITY
   ├── Edge → Network → App → Data layers
   ├── Each layer assumes others might fail
   ├── Zero trust between services
   └── Comprehensive audit logging

Self-Assessment Checklist

After studying this capstone, you should be able to:

This capstone integrates all concepts from Week 9: Multi-Tenancy, Security, and Compliance. Use this as a template for approaching enterprise SaaS system design interviews.

Back to Course Overview

Week 9 Capstone: Design a Multi-Tenant Enterprise SaaS Platform

🎯 A Complete System Design Interview Integrating Everything You've Learned

The Interview Begins

Phase 1: Requirements Clarification (5 minutes)

Clarifying Questions

Functional Requirements

Non-Functional Requirements

Phase 2: Back of the Envelope Estimation (5 minutes)

Storage Estimation

Traffic Estimation

Infrastructure Estimation

Phase 3: High-Level Design (10 minutes)

System Architecture

Component Breakdown

1. Global Control Plane

2. Regional Data Plane

3. Document Processing Pipeline

4. Search Architecture

Phase 4: Deep Dives (20 minutes)

Deep Dive 1: Tenant Isolation (Week 9, Day 1)

The Problem

Multi-Layer Isolation Strategy

Implementation

Database Row-Level Security

Enterprise Tier Isolation

Deep Dive 2: Noisy Neighbor Prevention (Week 9, Day 2)

Resource Quota System

Implementation

Fair Scheduling for Background Jobs

Deep Dive 3: Data Residency and GDPR (Week 9, Day 3)

Regional Data Routing

GDPR Consent Management

Data Export (Right to Portability)

Deep Dive 4: Right to Deletion (Week 9, Day 4)

Deletion Orchestration

Deep Dive 5: Security Architecture (Week 9, Day 5)

Security Layers

Authentication and Authorization

Comprehensive Audit Logging

Phase 5: Scaling and Edge Cases (5 minutes)

Scaling Strategy

Edge Cases

Phase 6: Monitoring and Operations (5 minutes)

Key Metrics

Alerting Strategy

Interview Conclusion

Summary: Week 9 Concepts Applied

Concepts by Day

Code Patterns Demonstrated

Self-Assessment Checklist