Himanshu Kukreja
0%
LearnSystem DesignWeek 9Interview Week 9 Multi Tenant Saas
Capstone

Week 9 Capstone: Design a Multi-Tenant Enterprise SaaS Platform

🎯 A Complete System Design Interview Integrating Everything You've Learned


The Interview Begins

You walk into the interview room at a well-known B2B SaaS company. The interviewer, a Staff Engineer, greets you warmly.

Interviewer: "Thanks for coming in today. We're going to work through a system design problem that reflects challenges we face here. I want to see how you think through complex multi-tenant systems. Feel free to ask questions β€” this is collaborative."

They turn to the whiteboard and write:

╔══════════════════════════════════════════════════════════════════════════╗
β•‘                                                                          β•‘
β•‘           Design: Enterprise Document Management Platform                β•‘
β•‘                                                                          β•‘
β•‘   You're building a B2B SaaS platform where companies can:               β•‘
β•‘                                                                          β•‘
β•‘   β€’ Store and organize documents (contracts, policies, reports)          β•‘
β•‘   β€’ Search across all their documents with full-text search              β•‘
β•‘   β€’ Collaborate on documents with comments and version history           β•‘
β•‘   β€’ Set granular permissions (who can view/edit/delete)                  β•‘
β•‘   β€’ Generate audit trails for compliance                                 β•‘
β•‘   β€’ Export their data for compliance or migration                        β•‘
β•‘                                                                          β•‘
β•‘   Customer profile:                                                      β•‘
β•‘   β€’ 500 enterprise customers (tenants)                                   β•‘
β•‘   β€’ Mix of US, EU, and APAC customers                                    β•‘
β•‘   β€’ Some customers in regulated industries (healthcare, finance)         β•‘
β•‘   β€’ Largest customer has 50,000 users and 10M documents                  β•‘
β•‘   β€’ Smallest customers have 50 users and 10K documents                   β•‘
β•‘                                                                          β•‘
β•‘   Key concerns from sales:                                               β•‘
β•‘   β€’ EU customers asking about GDPR and data residency                    β•‘
β•‘   β€’ Healthcare customers need HIPAA compliance                           β•‘
β•‘   β€’ Enterprise customers want SSO and audit logs                         β•‘
β•‘   β€’ Several deals lost because "search was too slow"                     β•‘
β•‘                                                                          β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."


Phase 1: Requirements Clarification (5 minutes)

You take a breath and start asking questions.

You: "Before I dive into the design, I'd like to clarify some requirements. Let me start with scale and then move to compliance."

Clarifying Questions

You: "For scale β€” you mentioned 500 tenants with the largest having 10M documents. What's the average document size, and what's our total storage footprint?"

Interviewer: "Average document is about 500KB. Mix of PDFs, Word docs, and some larger files like presentations. Total storage across all tenants is about 50TB, growing 20% annually."

You: "What's our request volume? How many searches, uploads, downloads per day?"

Interviewer: "Peak hours see about 1,000 requests per second across all tenants. Searches are the most common operation β€” maybe 60% of traffic. Uploads are bursty, especially end of quarter when companies finalize contracts."

You: "For the EU data residency requirement β€” do EU customers need all their data in the EU, or just personal data?"

Interviewer: "Great question. Our legal team says all document content and metadata for EU customers must stay in EU. Some operational data like aggregate metrics can be global."

You: "For the regulated industries β€” are we targeting HIPAA certification, or just 'HIPAA-ready' architecture?"

Interviewer: "We want the architecture to support HIPAA. Full certification is a business decision, but the technical foundation must be there. Same for SOC 2."

You: "One more β€” for the largest customers, do they get dedicated infrastructure, or is everyone on shared infrastructure with isolation?"

Interviewer: "We want to offer both. Most customers on shared infrastructure with strong isolation. Enterprise tier customers can opt for dedicated resources at premium pricing."

You: "Perfect. Let me summarize the requirements."

Functional Requirements

1. DOCUMENT MANAGEMENT
   β”œβ”€β”€ Upload documents (PDF, Word, Excel, images)
   β”œβ”€β”€ Organize in folders/hierarchies
   β”œβ”€β”€ Version history with rollback
   β”œβ”€β”€ Preview and download
   └── Bulk operations (move, delete, export)

2. SEARCH
   β”œβ”€β”€ Full-text search across document content
   β”œβ”€β”€ Metadata search (author, date, tags)
   β”œβ”€β”€ Filters and facets
   β”œβ”€β”€ Search within folders
   └── < 500ms p99 latency for search

3. COLLABORATION
   β”œβ”€β”€ Comments on documents
   β”œβ”€β”€ @mentions and notifications
   β”œβ”€β”€ Sharing with granular permissions
   β”œβ”€β”€ Activity feed per document
   └── Real-time presence (who's viewing)

4. ACCESS CONTROL
   β”œβ”€β”€ Role-based permissions (viewer, editor, admin)
   β”œβ”€β”€ Folder-level and document-level permissions
   β”œβ”€β”€ External sharing with expiring links
   β”œβ”€β”€ SSO integration (SAML, OIDC)
   └── MFA support

5. COMPLIANCE
   β”œβ”€β”€ Complete audit trail (who did what, when)
   β”œβ”€β”€ Data export (GDPR portability)
   β”œβ”€β”€ Data deletion (GDPR right to erasure)
   β”œβ”€β”€ Retention policies
   └── Legal hold capability

Non-Functional Requirements

1. SCALE
   β”œβ”€β”€ 500 tenants, growing to 2,000
   β”œβ”€β”€ 50TB storage, growing 20% annually
   β”œβ”€β”€ 1,000 requests/second peak
   β”œβ”€β”€ Largest tenant: 50K users, 10M documents
   └── Support 100K concurrent users globally

2. LATENCY
   β”œβ”€β”€ Search: < 500ms p99
   β”œβ”€β”€ Document preview: < 2s p99
   β”œβ”€β”€ Upload (10MB): < 5s p99
   └── API calls: < 200ms p99

3. AVAILABILITY
   β”œβ”€β”€ 99.9% uptime SLA
   β”œβ”€β”€ No single point of failure
   β”œβ”€β”€ Graceful degradation
   └── < 4 hours RTO, < 1 hour RPO

4. COMPLIANCE
   β”œβ”€β”€ EU data residency for EU customers
   β”œβ”€β”€ HIPAA-ready architecture
   β”œβ”€β”€ SOC 2 Type II controls
   └── GDPR compliance (consent, deletion, portability)

5. SECURITY
   β”œβ”€β”€ Encryption at rest and in transit
   β”œβ”€β”€ Tenant isolation (data and resources)
   β”œβ”€β”€ Zero trust architecture
   └── Regular security audits

Phase 2: Back of the Envelope Estimation (5 minutes)

You: "Let me work through the numbers to validate our architecture decisions."

Storage Estimation

DOCUMENT STORAGE

Current state:
  Total documents:           ~100M (across all tenants)
  Average document size:     500KB
  Total storage:             100M Γ— 500KB = 50TB

  With metadata overhead:    50TB Γ— 1.2 = 60TB
  With 3x replication:       60TB Γ— 3 = 180TB raw storage

Growth projection (20% annually):
  Year 1:                    60TB
  Year 2:                    72TB
  Year 3:                    86TB
  Year 5:                    150TB

Per-tenant storage:
  Average tenant:            100GB (200K documents)
  Largest tenant:            5TB (10M documents)
  Smallest tenant:           5GB (10K documents)

Traffic Estimation

REQUEST TRAFFIC

Peak requests:               1,000/second
Daily requests:              ~50M (assuming 8-hour peak)

Breakdown by operation:
β”œβ”€β”€ Search:                  600/sec (60%)
β”œβ”€β”€ Read/download:           250/sec (25%)
β”œβ”€β”€ Write/upload:            100/sec (10%)
└── Other (permissions, etc): 50/sec (5%)

Search index size:
  Average document text:     10KB extracted text
  Total index size:          100M Γ— 10KB = 1TB
  With Elasticsearch overhead: ~3TB

Infrastructure Estimation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INFRASTRUCTURE ESTIMATE                             β”‚
β”‚                                                                        β”‚
β”‚  COMPUTE                                                               β”‚
β”‚  β”œβ”€β”€ API servers:              20 instances (m5.xlarge)                β”‚
β”‚  β”œβ”€β”€ Search cluster:           9 nodes (r5.2xlarge) - 3 per region     β”‚
β”‚  β”œβ”€β”€ Background workers:       10 instances (m5.large)                 β”‚
β”‚  └── Total:                    ~40 instances                           β”‚
β”‚                                                                        β”‚
β”‚  STORAGE                                                               β”‚
β”‚  β”œβ”€β”€ PostgreSQL:               Multi-AZ, 2TB per region                β”‚
β”‚  β”œβ”€β”€ Elasticsearch:            3TB per region                          β”‚
β”‚  β”œβ”€β”€ S3:                       180TB (with replication)                β”‚
β”‚  └── Redis:                    50GB cluster                            β”‚
β”‚                                                                        β”‚
β”‚  REGIONS                                                               β”‚
β”‚  β”œβ”€β”€ US (us-east-1):           Primary for US customers                β”‚
β”‚  β”œβ”€β”€ EU (eu-central-1):        Primary for EU customers                β”‚
β”‚  └── APAC (ap-southeast-1):    Primary for APAC customers              β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 3: High-Level Design (10 minutes)

You: "Now let me sketch out the high-level architecture. Given our multi-region, multi-tenant requirements, I'll design for regional data isolation with a global control plane."

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    HIGH-LEVEL ARCHITECTURE                            β”‚
β”‚                                                                       β”‚
β”‚                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚                        β”‚   Global Control    β”‚                        β”‚
β”‚                        β”‚   Plane             β”‚                        β”‚
β”‚                        β”‚   β”œβ”€β”€ Tenant Config β”‚                        β”‚
β”‚                        β”‚   β”œβ”€β”€ Routing Rules β”‚                        β”‚
β”‚                        β”‚   └── Feature Flags β”‚                        β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                                   β”‚                                   β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚         β”‚                         β”‚                         β”‚         β”‚
β”‚         β–Ό                         β–Ό                         β–Ό         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  US Region  β”‚          β”‚  EU Region  β”‚          β”‚ APAC Region β”‚    β”‚
β”‚  β”‚             β”‚          β”‚             β”‚          β”‚             β”‚    β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚ β”‚   CDN   β”‚ β”‚          β”‚ β”‚   CDN   β”‚ β”‚          β”‚ β”‚   CDN   β”‚ β”‚    β”‚
β”‚  β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β”‚      β”‚      β”‚          β”‚      β”‚      β”‚          β”‚      β”‚      β”‚    β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚ β”‚   WAF   β”‚ β”‚          β”‚ β”‚   WAF   β”‚ β”‚          β”‚ β”‚   WAF   β”‚ β”‚    β”‚
β”‚  β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β”‚      β”‚      β”‚          β”‚      β”‚      β”‚          β”‚      β”‚      β”‚    β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚ β”‚   ALB   β”‚ β”‚          β”‚ β”‚   ALB   β”‚ β”‚          β”‚ β”‚   ALB   β”‚ β”‚    β”‚
β”‚  β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β”‚      β”‚      β”‚          β”‚      β”‚      β”‚          β”‚      β”‚      β”‚    β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚ β”‚   API   β”‚ β”‚          β”‚ β”‚   API   β”‚ β”‚          β”‚ β”‚   API   β”‚ β”‚    β”‚
β”‚  β”‚ β”‚ Cluster β”‚ β”‚          β”‚ β”‚ Cluster β”‚ β”‚          β”‚ β”‚ Cluster β”‚ β”‚    β”‚
β”‚  β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β”‚      β”‚      β”‚          β”‚      β”‚      β”‚          β”‚      β”‚      β”‚    β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚ β”‚         β”‚ β”‚          β”‚ β”‚         β”‚ β”‚          β”‚ β”‚         β”‚ β”‚    β”‚
β”‚  β”‚ β–Ό         β–Ό β”‚          β”‚ β–Ό         β–Ό β”‚          β”‚ β–Ό         β–Ό β”‚    β”‚
β”‚  β”‚β”Œβ”€β”€β”€β”   β”Œβ”€β”€β”€β”β”‚          β”‚β”Œβ”€β”€β”€β”   β”Œβ”€β”€β”€β”β”‚          β”‚β”Œβ”€β”€β”€β”   β”Œβ”€β”€β”€β”β”‚    β”‚
β”‚  β”‚β”‚ PGβ”‚   β”‚ ESβ”‚β”‚          β”‚β”‚ PGβ”‚   β”‚ ESβ”‚β”‚          β”‚β”‚ PGβ”‚   β”‚ ESβ”‚β”‚    β”‚
β”‚  β”‚β””β”€β”€β”€β”˜   β””β”€β”€β”€β”˜β”‚          β”‚β””β”€β”€β”€β”˜   β””β”€β”€β”€β”˜β”‚          β”‚β””β”€β”€β”€β”˜   β””β”€β”€β”€β”˜β”‚    β”‚
β”‚  β”‚      β”‚      β”‚          β”‚      β”‚      β”‚          β”‚      β”‚      β”‚    β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚ β”‚   S3    β”‚ β”‚          β”‚ β”‚   S3    β”‚ β”‚          β”‚ β”‚   S3    β”‚ β”‚    β”‚
β”‚  β”‚ β”‚ Bucket  β”‚ β”‚          β”‚ β”‚ Bucket  β”‚ β”‚          β”‚ β”‚ Bucket  β”‚ β”‚    β”‚
β”‚  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β”‚             β”‚          β”‚             β”‚          β”‚             β”‚    β”‚
β”‚  β”‚  US DATA    β”‚          β”‚  EU DATA    β”‚          β”‚ APAC DATA   β”‚    β”‚
β”‚  β”‚  ONLY       β”‚          β”‚  ONLY       β”‚          β”‚ ONLY        β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                       β”‚
β”‚  NO CROSS-REGION DATA FLOW FOR TENANT DATA                            β”‚
β”‚                                                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Breakdown

You: "Let me walk through each major component and the key design decisions."

1. Global Control Plane

Purpose: Manage tenant configuration and routing without storing personal data.

Global Control Plane contains:
β”œβ”€β”€ Tenant registry (which region each tenant is in)
β”œβ”€β”€ Feature flags (which features enabled per tenant)
β”œβ”€β”€ Plan/quota configuration
β”œβ”€β”€ Routing rules
└── Global admin interface

Does NOT contain:
β”œβ”€β”€ User data
β”œβ”€β”€ Document content
β”œβ”€β”€ Any personal information
└── Audit logs (those stay regional)

Technology: Single PostgreSQL with read replicas in each region for low-latency lookups.

2. Regional Data Plane

Purpose: Store and process all tenant data within the region.

Each region has independent:
β”œβ”€β”€ API cluster (stateless, horizontally scaled)
β”œβ”€β”€ PostgreSQL (users, documents metadata, permissions)
β”œβ”€β”€ Elasticsearch (search index)
β”œβ”€β”€ S3 (document storage)
β”œβ”€β”€ Redis (cache, sessions, rate limiting)
└── Kafka (event streaming, audit logs)

Key Decision: Complete data isolation. EU tenant data never leaves EU region.

3. Document Processing Pipeline

Document Upload Flow:

User uploads document
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   API Gateway     β”‚ ← Validates, rate limits
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Upload Service   β”‚ ← Generates presigned S3 URL
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                   β”‚
          β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   S3 (encrypted)  β”‚  β”‚   PostgreSQL      β”‚
β”‚   Store document  β”‚  β”‚   Store metadata  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Processing Queue β”‚ ← Async processing
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
    β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
    β”‚           β”‚
    β–Ό           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Text   β”‚  β”‚Thumbnail β”‚
β”‚Extract β”‚  β”‚Generate  β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Elasticsearch    β”‚ ← Index for search
β”‚  Update index     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4. Search Architecture

Search Flow:

User searches "contract 2024"
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   API Gateway     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Search Service                       β”‚
β”‚  β”œβ”€β”€ Add tenant_id filter (CRITICAL!) β”‚
β”‚  β”œβ”€β”€ Add permission filter            β”‚
β”‚  └── Build ES query                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Elasticsearch                      β”‚
β”‚  β”œβ”€β”€ Tenant-specific index OR.      |
β”‚  └── Filtered query with tenant_id  |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Results with     β”‚
β”‚  permission check β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 4: Deep Dives (20 minutes)

Interviewer: "Great high-level design. Let's dive deeper into some areas. First, tell me about tenant isolation. How do you ensure one tenant can never see another's data?"


Deep Dive 1: Tenant Isolation (Week 9, Day 1)

You: "Tenant isolation is the most critical aspect of this system. I'll implement defense in depth with multiple isolation layers."

The Problem

WITHOUT PROPER ISOLATION:

Tenant A searches for "confidential"
        β”‚
        β–Ό
SELECT * FROM documents 
WHERE content LIKE '%confidential%'
        β”‚
        β–Ό
Returns documents from ALL tenants!
Including Tenant B's confidential contracts.

This is a catastrophic data breach.

Multi-Layer Isolation Strategy

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    TENANT ISOLATION LAYERS                             β”‚
β”‚                                                                        β”‚
β”‚  LAYER 1: REQUEST ROUTING                                              β”‚
β”‚  β”œβ”€β”€ Tenant determined from subdomain (acme.docplatform.com)           β”‚
β”‚  β”œβ”€β”€ Or from JWT token claims                                          β”‚
β”‚  β”œβ”€β”€ Set in request context at API gateway                             β”‚
β”‚  └── CANNOT be overridden by request parameters                        β”‚
β”‚                                                                        β”‚
β”‚  LAYER 2: APPLICATION ENFORCEMENT                                      β”‚
β”‚  β”œβ”€β”€ TenantContext set for every request                               β”‚
β”‚  β”œβ”€β”€ Repository layer auto-filters by tenant_id                        β”‚
β”‚  β”œβ”€β”€ Service layer validates tenant ownership                          β”‚
β”‚  └── Audit log records tenant context                                  β”‚
β”‚                                                                        β”‚
β”‚  LAYER 3: DATABASE ENFORCEMENT (RLS)                                   β”‚
β”‚  β”œβ”€β”€ Row Level Security policies on all tables                         β”‚
β”‚  β”œβ”€β”€ Database session has current_tenant set                           β”‚
β”‚  β”œβ”€β”€ Even raw SQL queries are filtered                                 β”‚
β”‚  └── DBA cannot accidentally see cross-tenant                          β”‚
β”‚                                                                        β”‚
β”‚  LAYER 4: SEARCH INDEX ISOLATION                                       β”‚
β”‚  β”œβ”€β”€ Option A: Index per tenant (strong isolation)                     β”‚
β”‚  β”œβ”€β”€ Option B: Shared index with tenant_id filter                      β”‚
β”‚  β”œβ”€β”€ We use Option A for enterprise, B for standard                    β”‚
β”‚  └── Search service enforces tenant filter                             β”‚
β”‚                                                                        β”‚
β”‚  LAYER 5: STORAGE ISOLATION                                            β”‚
β”‚  β”œβ”€β”€ S3 path includes tenant_id                                        β”‚
β”‚  β”œβ”€β”€ IAM policies restrict cross-tenant access                         β”‚
β”‚  β”œβ”€β”€ Presigned URLs scoped to tenant prefix                            β”‚
β”‚  └── Encryption keys per tenant (enterprise)                           β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation

# tenant_isolation.py - Core tenant context management

from contextvars import ContextVar
from dataclasses import dataclass
from typing import Optional
from enum import Enum

# Thread-safe tenant context
_tenant_context: ContextVar[Optional['TenantContext']] = ContextVar(
    'tenant_context', default=None
)


class IsolationLevel(Enum):
    """Tenant isolation levels by plan."""
    SHARED = "shared"           # Shared tables, RLS
    SCHEMA = "schema"           # Schema per tenant
    DATABASE = "database"       # Database per tenant
    DEDICATED = "dedicated"     # Dedicated infrastructure


@dataclass(frozen=True)
class TenantContext:
    """Immutable tenant context for request processing."""
    tenant_id: str
    tenant_name: str
    region: str
    plan: str
    isolation_level: IsolationLevel
    encryption_key_id: str
    features: frozenset
    

def get_current_tenant() -> TenantContext:
    """Get current tenant context. Raises if not set."""
    ctx = _tenant_context.get()
    if ctx is None:
        raise TenantContextError("No tenant context set")
    return ctx


class TenantMiddleware:
    """
    Middleware that establishes tenant context for every request.
    
    Tenant is determined from:
    1. Subdomain (acme.docplatform.com β†’ acme)
    2. JWT token claims
    3. API key lookup
    
    NEVER from request body or query parameters.
    """
    
    async def __call__(self, request, call_next):
        # Extract tenant from subdomain
        host = request.headers.get("host", "")
        subdomain = host.split(".")[0]
        
        # Or from JWT
        if not subdomain or subdomain in ["www", "api"]:
            token = request.headers.get("authorization", "").replace("Bearer ", "")
            claims = decode_jwt(token)
            subdomain = claims.get("tenant_id")
        
        if not subdomain:
            return JSONResponse({"error": "Tenant not identified"}, status_code=400)
        
        # Load tenant configuration
        tenant_config = await self.tenant_service.get_tenant(subdomain)
        
        if not tenant_config:
            return JSONResponse({"error": "Tenant not found"}, status_code=404)
        
        # Set immutable context
        context = TenantContext(
            tenant_id=tenant_config.id,
            tenant_name=tenant_config.name,
            region=tenant_config.region,
            plan=tenant_config.plan,
            isolation_level=IsolationLevel(tenant_config.isolation_level),
            encryption_key_id=tenant_config.encryption_key_id,
            features=frozenset(tenant_config.features)
        )
        
        # Set context for this request
        token = _tenant_context.set(context)
        
        try:
            response = await call_next(request)
            return response
        finally:
            _tenant_context.reset(token)


class TenantAwareRepository:
    """
    Repository base class that enforces tenant isolation.
    
    ALL database access goes through this class.
    """
    
    def __init__(self, db_pool):
        self.db = db_pool
    
    async def find_by_id(self, table: str, id: str):
        """Find record by ID within current tenant."""
        tenant = get_current_tenant()
        
        # ALWAYS filter by tenant_id
        return await self.db.fetchone(
            f"SELECT * FROM {table} WHERE id = $1 AND tenant_id = $2",
            id, tenant.tenant_id
        )
    
    async def find_all(self, table: str, filters: dict = None):
        """Find all records within current tenant."""
        tenant = get_current_tenant()
        
        query = f"SELECT * FROM {table} WHERE tenant_id = $1"
        params = [tenant.tenant_id]
        
        if filters:
            for i, (key, value) in enumerate(filters.items(), start=2):
                query += f" AND {key} = ${i}"
                params.append(value)
        
        return await self.db.fetch(query, *params)
    
    async def create(self, table: str, data: dict):
        """Create record with automatic tenant_id."""
        tenant = get_current_tenant()
        
        # Force tenant_id - cannot be overridden
        data["tenant_id"] = tenant.tenant_id
        
        columns = ", ".join(data.keys())
        placeholders = ", ".join(f"${i+1}" for i in range(len(data)))
        
        return await self.db.fetchone(
            f"INSERT INTO {table} ({columns}) VALUES ({placeholders}) RETURNING *",
            *data.values()
        )

Database Row-Level Security

-- PostgreSQL Row-Level Security Setup

-- Enable RLS on all tenant tables
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
ALTER TABLE folders ENABLE ROW LEVEL SECURITY;
ALTER TABLE comments ENABLE ROW LEVEL SECURITY;
ALTER TABLE audit_logs ENABLE ROW LEVEL SECURITY;

-- Create policy that filters by current tenant
CREATE POLICY tenant_isolation_policy ON documents
    FOR ALL
    USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

CREATE POLICY tenant_isolation_policy ON users
    FOR ALL
    USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

-- Similar for all tables...

-- Application sets tenant before any query
-- SET app.current_tenant_id = 'tenant-uuid-here';

-- Now even this query is filtered:
-- SELECT * FROM documents; 
-- Only returns current tenant's documents!

Interviewer: "What about the large enterprise tenant with 10M documents? Don't they need stronger isolation?"

You: "Absolutely. For enterprise customers, we offer dedicated isolation."

Enterprise Tier Isolation

# enterprise_isolation.py

class EnterpriseTenantManager:
    """
    Manages dedicated resources for enterprise tenants.
    """
    
    async def provision_enterprise_tenant(
        self,
        tenant_id: str,
        config: EnterpriseConfig
    ):
        """
        Provision dedicated resources for enterprise tenant.
        
        Options:
        - Dedicated database
        - Dedicated Elasticsearch index
        - Dedicated encryption keys
        - Dedicated compute (optional)
        """
        
        # 1. Create dedicated database
        if config.dedicated_database:
            await self._create_tenant_database(tenant_id)
        
        # 2. Create dedicated search index
        await self._create_tenant_search_index(tenant_id)
        
        # 3. Create tenant-specific KMS key
        key_id = await self._create_tenant_kms_key(tenant_id)
        
        # 4. Update tenant config
        await self.tenant_service.update_tenant(
            tenant_id,
            isolation_level=IsolationLevel.DATABASE,
            encryption_key_id=key_id,
            database_name=f"tenant_{tenant_id}",
            search_index=f"documents_{tenant_id}"
        )
    
    async def _create_tenant_database(self, tenant_id: str):
        """Create dedicated PostgreSQL database for tenant."""
        safe_name = tenant_id.replace("-", "_")
        
        await self.admin_db.execute(f"""
            CREATE DATABASE tenant_{safe_name};
        """)
        
        # Run migrations on new database
        await self._run_migrations(f"tenant_{safe_name}")
    
    async def _create_tenant_kms_key(self, tenant_id: str) -> str:
        """Create dedicated KMS key for tenant encryption."""
        response = await self.kms.create_key(
            Description=f"Encryption key for tenant {tenant_id}",
            KeyUsage="ENCRYPT_DECRYPT",
            Tags=[
                {"TagKey": "tenant_id", "TagValue": tenant_id},
                {"TagKey": "purpose", "TagValue": "document_encryption"}
            ]
        )
        return response["KeyMetadata"]["KeyId"]

Deep Dive 2: Noisy Neighbor Prevention (Week 9, Day 2)

Interviewer: "You mentioned the largest tenant has 10M documents. How do you prevent them from impacting smaller tenants?"

You: "This is the noisy neighbor problem. I'd implement quotas and fair scheduling at multiple levels."

Resource Quota System

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    QUOTA SYSTEM BY PLAN                                β”‚
β”‚                                                                        β”‚
β”‚  Resource              β”‚ Standard   β”‚ Professional β”‚ Enterprise        β”‚
β”‚  ──────────────────────┼────────────┼──────────────┼─────────────────  β”‚
β”‚  API calls/minute      β”‚ 1,000      β”‚ 10,000       β”‚ 100,000           β”‚
β”‚  Search calls/minute   β”‚ 100        β”‚ 1,000        β”‚ 10,000            β”‚
β”‚  Storage (GB)          β”‚ 100        β”‚ 1,000        β”‚ 10,000            β”‚
β”‚  Documents             β”‚ 100,000    β”‚ 1,000,000    β”‚ Unlimited*        β”‚
β”‚  Users                 β”‚ 100        β”‚ 1,000        β”‚ Unlimited         β”‚
β”‚  Concurrent uploads    β”‚ 5          β”‚ 20           β”‚ 100               β”‚
β”‚  Max file size (MB)    β”‚ 50         β”‚ 200          β”‚ 500               β”‚
β”‚  Search query timeout  β”‚ 10s        β”‚ 30s          β”‚ 60s               β”‚
β”‚  Export rate (docs/hr) β”‚ 10,000     β”‚ 100,000      β”‚ 1,000,000         β”‚
β”‚                                                                        β”‚
β”‚  * "Unlimited" = 100M with fair use policy                             β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation

# noisy_neighbor/rate_limiter.py

class TenantResourceManager:
    """
    Manages and enforces resource quotas per tenant.
    """
    
    def __init__(self, redis, quota_config):
        self.redis = redis
        self.quotas = quota_config
    
    async def check_rate_limit(
        self,
        tenant_id: str,
        resource: str,
        cost: int = 1
    ) -> RateLimitResult:
        """
        Check if request is within rate limits.
        
        Uses token bucket algorithm with per-tenant buckets.
        """
        tenant = await self.get_tenant_config(tenant_id)
        limit = self.quotas[tenant.plan][resource]
        
        key = f"ratelimit:{tenant_id}:{resource}"
        
        # Lua script for atomic check-and-decrement
        script = """
        local key = KEYS[1]
        local limit = tonumber(ARGV[1])
        local cost = tonumber(ARGV[2])
        local window = tonumber(ARGV[3])
        local now = tonumber(ARGV[4])
        
        -- Get current count
        local current = tonumber(redis.call('GET', key) or '0')
        
        if current + cost <= limit then
            redis.call('INCRBY', key, cost)
            redis.call('EXPIRE', key, window)
            return {1, limit - current - cost}
        else
            local ttl = redis.call('TTL', key)
            return {0, ttl}
        end
        """
        
        result = await self.redis.eval(
            script,
            keys=[key],
            args=[limit, cost, 60, time.time()]
        )
        
        allowed = result[0] == 1
        
        if not allowed:
            await self._record_throttle(tenant_id, resource)
        
        return RateLimitResult(
            allowed=allowed,
            remaining=result[1] if allowed else 0,
            retry_after=result[1] if not allowed else None
        )
    
    async def check_concurrent_limit(
        self,
        tenant_id: str,
        operation: str
    ) -> bool:
        """
        Check concurrent operation limit.
        
        Prevents one tenant from using all worker capacity.
        """
        tenant = await self.get_tenant_config(tenant_id)
        limit = self.quotas[tenant.plan][f"concurrent_{operation}"]
        
        key = f"concurrent:{tenant_id}:{operation}"
        current = await self.redis.get(key) or 0
        
        return int(current) < limit
    
    async def acquire_concurrent_slot(
        self,
        tenant_id: str,
        operation: str,
        operation_id: str,
        ttl: int = 300
    ) -> bool:
        """Acquire a concurrent operation slot."""
        if not await self.check_concurrent_limit(tenant_id, operation):
            return False
        
        key = f"concurrent:{tenant_id}:{operation}"
        await self.redis.sadd(key, operation_id)
        await self.redis.expire(key, ttl)
        return True
    
    async def release_concurrent_slot(
        self,
        tenant_id: str,
        operation: str,
        operation_id: str
    ):
        """Release a concurrent operation slot."""
        key = f"concurrent:{tenant_id}:{operation}"
        await self.redis.srem(key, operation_id)


class SearchQueryGuard:
    """
    Guards search queries against noisy neighbors.
    """
    
    async def guard_search(
        self,
        tenant_id: str,
        query: SearchQuery
    ) -> tuple[bool, Optional[int]]:
        """
        Check if search query should be allowed.
        
        Returns (allowed, timeout_seconds).
        """
        tenant = await self.tenant_service.get_tenant(tenant_id)
        
        # Get tenant's search timeout
        timeout = self.quotas[tenant.plan]["search_timeout"]
        
        # Estimate query cost
        cost = self._estimate_query_cost(query)
        
        if cost > 100:  # High cost query
            # Check if tenant can run expensive queries
            if tenant.plan == "standard":
                return False, None
            
            # Use longer timeout for expensive queries
            timeout = min(timeout * 2, 120)
        
        return True, timeout
    
    def _estimate_query_cost(self, query: SearchQuery) -> int:
        """Estimate query cost based on complexity."""
        cost = 1
        
        # Wildcard queries are expensive
        if "*" in query.text or "?" in query.text:
            cost += 10
        
        # Regex queries are very expensive
        if query.use_regex:
            cost += 50
        
        # Large result sets
        if query.limit > 100:
            cost += query.limit // 100
        
        return cost

Fair Scheduling for Background Jobs

# noisy_neighbor/fair_scheduler.py

class TenantFairQueue:
    """
    Fair queue that prevents one tenant from monopolizing workers.
    
    Uses weighted fair queuing based on tenant plan.
    """
    
    PLAN_WEIGHTS = {
        "standard": 1,
        "professional": 3,
        "enterprise": 10
    }
    
    async def enqueue(
        self,
        tenant_id: str,
        job: Job
    ):
        """Add job to tenant's queue."""
        tenant = await self.tenant_service.get_tenant(tenant_id)
        
        # Check queue depth limit
        current_depth = await self.get_queue_depth(tenant_id)
        max_depth = self.quotas[tenant.plan]["max_queue_depth"]
        
        if current_depth >= max_depth:
            raise QueueFullError(
                f"Queue full for tenant {tenant_id}. "
                f"Current: {current_depth}, Max: {max_depth}"
            )
        
        # Add to tenant's queue
        await self.redis.lpush(
            f"queue:{tenant_id}",
            job.serialize()
        )
    
    async def dequeue(self) -> Optional[tuple[str, Job]]:
        """
        Dequeue next job using weighted fair scheduling.
        
        Higher-weight tenants get proportionally more slots.
        """
        # Get all tenant queues with pending jobs
        tenant_queues = await self._get_active_queues()
        
        if not tenant_queues:
            return None
        
        # Calculate weighted selection
        weighted_tenants = []
        for tenant_id, depth in tenant_queues.items():
            tenant = await self.tenant_service.get_tenant(tenant_id)
            weight = self.PLAN_WEIGHTS.get(tenant.plan, 1)
            # Weight decreases as queue depth increases (fairness)
            adjusted_weight = weight / (1 + depth * 0.1)
            weighted_tenants.append((tenant_id, adjusted_weight))
        
        # Select tenant based on weights
        selected_tenant = self._weighted_random_choice(weighted_tenants)
        
        # Pop job from selected tenant's queue
        job_data = await self.redis.rpop(f"queue:{selected_tenant}")
        
        if job_data:
            return selected_tenant, Job.deserialize(job_data)
        
        return None

Deep Dive 3: Data Residency and GDPR (Week 9, Day 3)

Interviewer: "You mentioned EU customers need data in EU. How exactly does that work, and what about GDPR compliance?"

You: "Data residency and GDPR are handled through regional isolation and comprehensive data management. Let me walk through both."

Regional Data Routing

# data_residency/router.py

class RegionalDataRouter:
    """
    Routes data operations to the correct region based on tenant.
    
    EU tenant data NEVER touches US infrastructure.
    """
    
    REGION_CONFIGS = {
        "us": {
            "database": "postgres://db-us.internal:5432/app",
            "elasticsearch": "https://es-us.internal:9200",
            "s3_bucket": "documents-us-east-1",
            "redis": "redis://cache-us.internal:6379"
        },
        "eu": {
            "database": "postgres://db-eu.internal:5432/app",
            "elasticsearch": "https://es-eu.internal:9200",
            "s3_bucket": "documents-eu-central-1",
            "redis": "redis://cache-eu.internal:6379"
        },
        "apac": {
            "database": "postgres://db-apac.internal:5432/app",
            "elasticsearch": "https://es-apac.internal:9200",
            "s3_bucket": "documents-ap-southeast-1",
            "redis": "redis://cache-apac.internal:6379"
        }
    }
    
    async def get_database_connection(self, tenant_id: str):
        """Get database connection for tenant's region."""
        tenant = await self.tenant_service.get_tenant(tenant_id)
        config = self.REGION_CONFIGS[tenant.region]
        
        return await self.connection_pools[tenant.region].acquire()
    
    async def get_storage_bucket(self, tenant_id: str) -> str:
        """Get S3 bucket for tenant's region."""
        tenant = await self.tenant_service.get_tenant(tenant_id)
        return self.REGION_CONFIGS[tenant.region]["s3_bucket"]
    
    async def upload_document(
        self,
        tenant_id: str,
        document_id: str,
        content: bytes
    ) -> str:
        """Upload document to tenant's regional storage."""
        tenant = await self.tenant_service.get_tenant(tenant_id)
        bucket = self.REGION_CONFIGS[tenant.region]["s3_bucket"]
        
        # Path includes tenant for isolation
        key = f"tenants/{tenant_id}/documents/{document_id}"
        
        # Encrypt with tenant's key before upload
        encrypted = await self.encrypt_for_tenant(tenant_id, content)
        
        await self.s3.put_object(
            Bucket=bucket,
            Key=key,
            Body=encrypted,
            ServerSideEncryption="aws:kms",
            SSEKMSKeyId=tenant.encryption_key_id
        )
        
        return f"s3://{bucket}/{key}"
# gdpr/consent.py

class GDPRConsentService:
    """
    Manages user consent for GDPR compliance.
    """
    
    CONSENT_PURPOSES = [
        "service_delivery",      # Required for service
        "analytics",             # Product analytics
        "marketing_email",       # Marketing communications
        "third_party_sharing",   # Sharing with partners
    ]
    
    async def record_consent(
        self,
        user_id: str,
        tenant_id: str,
        purpose: str,
        granted: bool,
        ip_address: str,
        consent_text: str
    ) -> ConsentRecord:
        """
        Record a consent decision.
        
        Creates immutable audit record.
        """
        record = ConsentRecord(
            id=str(uuid.uuid4()),
            user_id=user_id,
            tenant_id=tenant_id,
            purpose=purpose,
            status="granted" if granted else "denied",
            granted_at=datetime.utcnow() if granted else None,
            ip_address=ip_address,
            consent_text=consent_text,
            consent_version="2024-01"
        )
        
        # Store in regional database
        await self.db.execute(
            """
            INSERT INTO consent_records 
            (id, user_id, tenant_id, purpose, status, granted_at,
             ip_address, consent_text, consent_version, created_at)
            VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
            """,
            record.id, record.user_id, record.tenant_id,
            record.purpose, record.status, record.granted_at,
            record.ip_address, record.consent_text,
            record.consent_version, datetime.utcnow()
        )
        
        # Publish event for downstream systems
        await self.events.publish("consent", {
            "type": "consent.recorded",
            "user_id": user_id,
            "purpose": purpose,
            "granted": granted
        })
        
        return record
    
    async def check_consent(
        self,
        user_id: str,
        tenant_id: str,
        purpose: str
    ) -> bool:
        """Check if user has consented to a purpose."""
        result = await self.db.fetchone(
            """
            SELECT status FROM consent_records
            WHERE user_id = $1 AND tenant_id = $2 AND purpose = $3
            ORDER BY created_at DESC
            LIMIT 1
            """,
            user_id, tenant_id, purpose
        )
        
        return result and result["status"] == "granted"

Data Export (Right to Portability)

# gdpr/export.py

class GDPRDataExporter:
    """
    Exports user data for GDPR portability requests.
    """
    
    async def export_user_data(
        self,
        user_id: str,
        tenant_id: str
    ) -> DataExportResult:
        """
        Export all user's personal data.
        
        GDPR Article 20: Right to data portability
        """
        export_id = str(uuid.uuid4())
        
        # Collect data from all sources
        data = {
            "export_metadata": {
                "export_id": export_id,
                "exported_at": datetime.utcnow().isoformat(),
                "user_id": user_id,
                "tenant_id": tenant_id
            },
            "profile": await self._export_profile(user_id, tenant_id),
            "documents": await self._export_documents(user_id, tenant_id),
            "comments": await self._export_comments(user_id, tenant_id),
            "activity_history": await self._export_activity(user_id, tenant_id),
            "consent_records": await self._export_consent(user_id, tenant_id),
        }
        
        # Package as JSON
        export_json = json.dumps(data, indent=2, default=str)
        
        # Also create ZIP with actual files
        zip_buffer = await self._create_export_zip(user_id, tenant_id, data)
        
        # Upload to tenant's regional storage
        bucket = await self.router.get_storage_bucket(tenant_id)
        export_key = f"exports/{tenant_id}/{user_id}/{export_id}.zip"
        
        await self.s3.put_object(
            Bucket=bucket,
            Key=export_key,
            Body=zip_buffer.getvalue()
        )
        
        # Generate download link (expires in 7 days)
        download_url = await self.s3.generate_presigned_url(
            "get_object",
            Params={"Bucket": bucket, "Key": export_key},
            ExpiresIn=604800  # 7 days
        )
        
        return DataExportResult(
            export_id=export_id,
            download_url=download_url,
            expires_at=datetime.utcnow() + timedelta(days=7),
            size_bytes=len(zip_buffer.getvalue())
        )

Deep Dive 4: Right to Deletion (Week 9, Day 4)

Interviewer: "When a user requests deletion, how do you ensure all their data is removed from all those systems?"

You: "Deletion is one of the hardest compliance requirements. I'd implement a coordinated deletion workflow with verification."

Deletion Orchestration

# gdpr/deletion.py

class UserDeletionService:
    """
    Orchestrates user data deletion across all systems.
    """
    
    # Systems in deletion order (dependencies first)
    DELETION_TARGETS = [
        ("cache", "redis", 1),           # Clear cache first
        ("search", "elasticsearch", 2),   # Remove from search
        ("storage", "s3", 3),             # Delete files
        ("analytics", "bigquery", 4),     # Remove from analytics
        ("database", "postgresql", 10),   # Primary DB last
    ]
    
    async def process_deletion_request(
        self,
        user_id: str,
        tenant_id: str,
        requested_by: str
    ) -> DeletionRequest:
        """
        Process a GDPR deletion request.
        
        Must complete within 30 days per GDPR.
        """
        request = DeletionRequest(
            id=str(uuid.uuid4()),
            user_id=user_id,
            tenant_id=tenant_id,
            requested_at=datetime.utcnow(),
            requested_by=requested_by,
            deadline=datetime.utcnow() + timedelta(days=30),
            status="pending"
        )
        
        # Store request
        await self._save_request(request)
        
        # Execute deletion workflow
        try:
            await self._execute_deletion(request)
            
            # Verify deletion
            verification = await self._verify_deletion(request)
            
            if verification.all_verified:
                request.status = "completed"
                request.completed_at = datetime.utcnow()
            else:
                request.status = "partial"
                request.issues = verification.issues
                
        except Exception as e:
            request.status = "failed"
            request.error = str(e)
            raise
        
        finally:
            await self._save_request(request)
            await self._notify_user(request)
        
        return request
    
    async def _execute_deletion(self, request: DeletionRequest):
        """Execute deletion across all systems."""
        
        for system_type, system_name, priority in sorted(
            self.DELETION_TARGETS, key=lambda x: x[2]
        ):
            target = DeletionTarget(
                system_name=system_name,
                system_type=system_type,
                status="pending"
            )
            
            try:
                executor = self.executors[system_name]
                result = await executor.delete_user_data(
                    request.user_id,
                    request.tenant_id
                )
                
                target.status = "completed"
                target.records_deleted = result.get("records_deleted", 0)
                
                await self.audit.log(
                    action="deletion_executed",
                    system=system_name,
                    user_id=request.user_id,
                    records_deleted=target.records_deleted
                )
                
            except Exception as e:
                target.status = "failed"
                target.error = str(e)
                raise
            
            request.targets.append(target)
    
    async def _verify_deletion(
        self,
        request: DeletionRequest
    ) -> VerificationResult:
        """Verify that deletion was successful."""
        issues = []
        
        for target in request.targets:
            executor = self.executors[target.system_name]
            
            still_exists = await executor.check_user_exists(
                request.user_id,
                request.tenant_id
            )
            
            if still_exists:
                issues.append(f"Data still exists in {target.system_name}")
        
        return VerificationResult(
            all_verified=len(issues) == 0,
            issues=issues
        )


class PostgreSQLDeletionExecutor:
    """
    Deletes user data from PostgreSQL.
    """
    
    async def delete_user_data(
        self,
        user_id: str,
        tenant_id: str
    ) -> dict:
        """Delete user and related data."""
        records_deleted = 0
        
        async with self.db.transaction():
            # Delete from leaf tables first
            
            # Comments (anonymize, keep content)
            result = await self.db.execute(
                """
                UPDATE comments 
                SET user_id = NULL, author_name = 'Deleted User'
                WHERE user_id = $1 AND tenant_id = $2
                """,
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
            
            # Activity logs (anonymize)
            result = await self.db.execute(
                """
                UPDATE activity_logs
                SET user_id = 'DELETED', ip_address = 'DELETED'
                WHERE user_id = $1 AND tenant_id = $2
                """,
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
            
            # Documents (reassign ownership to admin or delete)
            result = await self.db.execute(
                """
                UPDATE documents
                SET owner_id = (
                    SELECT id FROM users 
                    WHERE tenant_id = $2 AND 'admin' = ANY(roles)
                    LIMIT 1
                )
                WHERE owner_id = $1 AND tenant_id = $2
                """,
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
            
            # Consent records (keep anonymized for audit)
            result = await self.db.execute(
                """
                UPDATE consent_records
                SET user_id = 'DELETED', ip_address = 'DELETED'
                WHERE user_id = $1 AND tenant_id = $2
                """,
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
            
            # Finally, delete user
            result = await self.db.execute(
                "DELETE FROM users WHERE id = $1 AND tenant_id = $2",
                user_id, tenant_id
            )
            records_deleted += int(result.split()[-1])
        
        return {"records_deleted": records_deleted}

Deep Dive 5: Security Architecture (Week 9, Day 5)

Interviewer: "Let's talk security. How do you protect this system, especially with multiple tenants?"

You: "Security is defense in depth with zero trust principles. Let me walk through the layers."

Security Layers

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SECURITY ARCHITECTURE                               β”‚
β”‚                                                                        β”‚
β”‚  LAYER 1: EDGE SECURITY                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  CloudFront (CDN) β†’ WAF β†’ Rate Limiting β†’ DDoS Protection       β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ OWASP Top 10 rules                                         β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Bot detection                                              β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ IP reputation                                              β”‚   β”‚
β”‚  β”‚  └── Geo-blocking (optional per tenant)                         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                        β”‚
β”‚  LAYER 2: NETWORK SECURITY                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  VPC isolation:                                                 β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Public subnet: ALB only                                    β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Private subnet: App servers                                β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Isolated subnet: Databases                                 β”‚   β”‚
β”‚  β”‚  └── Security groups: Explicit allow only                       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                        β”‚
β”‚  LAYER 3: APPLICATION SECURITY                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  β”œβ”€β”€ Authentication (JWT + MFA)                                 β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Authorization (RBAC + tenant isolation)                    β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Input validation (Pydantic schemas)                        β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Output encoding                                            β”‚   β”‚
β”‚  β”‚  └── CSRF protection                                            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                        β”‚
β”‚  LAYER 4: DATA SECURITY                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  β”œβ”€β”€ Encryption in transit (TLS 1.3)                            β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Encryption at rest (AES-256)                               β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Per-tenant encryption keys (enterprise)                    β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Secrets in Vault                                           β”‚   β”‚
β”‚  β”‚  └── Data classification and handling                           β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                        β”‚
β”‚  LAYER 5: MONITORING & DETECTION                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  β”œβ”€β”€ Audit logging (all access)                                 β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ Anomaly detection                                          β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ SIEM integration                                           β”‚   β”‚
β”‚  β”‚  └── Incident response automation                               β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Authentication and Authorization

# security/auth.py

class AuthenticationService:
    """
    Multi-tenant authentication service.
    """
    
    async def authenticate(
        self,
        email: str,
        password: str,
        tenant_id: str,
        ip_address: str
    ) -> AuthResult:
        """
        Authenticate user within their tenant.
        """
        # Rate limiting per IP + tenant
        if await self._is_rate_limited(ip_address, tenant_id):
            raise AuthError("Too many attempts")
        
        # Find user in tenant
        user = await self.db.fetchone(
            """
            SELECT id, email, password_hash, roles, mfa_enabled, status
            FROM users
            WHERE email = $1 AND tenant_id = $2
            """,
            email.lower(), tenant_id
        )
        
        if not user:
            await self._record_failed_attempt(ip_address, tenant_id)
            raise AuthError("Invalid credentials")
        
        if user["status"] != "active":
            raise AuthError("Account disabled")
        
        # Verify password
        if not bcrypt.checkpw(password.encode(), user["password_hash"].encode()):
            await self._record_failed_attempt(ip_address, tenant_id)
            raise AuthError("Invalid credentials")
        
        # Create session
        session_id = secrets.token_urlsafe(32)
        
        result = AuthResult(
            user_id=user["id"],
            tenant_id=tenant_id,
            roles=user["roles"],
            mfa_required=user["mfa_enabled"],
            session_id=session_id
        )
        
        # Audit log
        await self.audit.log(
            action="login_success",
            user_id=user["id"],
            tenant_id=tenant_id,
            ip_address=ip_address
        )
        
        return result
    
    async def create_jwt(self, auth_result: AuthResult) -> str:
        """Create JWT with tenant claims."""
        signing_key = await self.secrets.get_secret("jwt/signing_key")
        
        payload = {
            "sub": auth_result.user_id,
            "tenant_id": auth_result.tenant_id,
            "roles": auth_result.roles,
            "session_id": auth_result.session_id,
            "iat": datetime.utcnow(),
            "exp": datetime.utcnow() + timedelta(minutes=15)
        }
        
        return jwt.encode(payload, signing_key.value, algorithm="RS256")


class AuthorizationService:
    """
    Multi-tenant authorization with RBAC.
    """
    
    async def check_document_access(
        self,
        user_id: str,
        tenant_id: str,
        document_id: str,
        required_permission: str
    ) -> bool:
        """
        Check if user can access document.
        
        Enforces:
        1. Tenant isolation (user's tenant == document's tenant)
        2. Role-based permission
        3. Document-level permission
        """
        # Get document
        document = await self.db.fetchone(
            """
            SELECT tenant_id, owner_id, permissions
            FROM documents
            WHERE id = $1
            """,
            document_id
        )
        
        if not document:
            return False
        
        # CRITICAL: Tenant isolation check
        if document["tenant_id"] != tenant_id:
            await self.audit.log(
                action="access_denied",
                reason="tenant_mismatch",
                user_id=user_id,
                document_id=document_id
            )
            return False
        
        # Check ownership
        if document["owner_id"] == user_id:
            return True
        
        # Check document permissions
        permissions = document["permissions"] or {}
        user_permission = permissions.get(user_id)
        
        if user_permission:
            return self._has_permission(user_permission, required_permission)
        
        # Check folder permissions (inherited)
        # ... folder permission check logic ...
        
        return False

Comprehensive Audit Logging

# security/audit.py

class AuditService:
    """
    Comprehensive audit logging for compliance.
    """
    
    async def log(
        self,
        action: str,
        **context
    ):
        """
        Log an audit event.
        
        All access, modifications, and security events are logged.
        """
        tenant = get_current_tenant()
        
        event = AuditEvent(
            id=str(uuid.uuid4()),
            timestamp=datetime.utcnow(),
            tenant_id=tenant.tenant_id if tenant else context.get("tenant_id"),
            action=action,
            actor_id=context.get("user_id"),
            actor_type=context.get("actor_type", "user"),
            resource_type=context.get("resource_type"),
            resource_id=context.get("resource_id"),
            ip_address=context.get("ip_address"),
            user_agent=context.get("user_agent"),
            details=context
        )
        
        # Write to regional audit log (immutable)
        await self.db.execute(
            """
            INSERT INTO audit_logs
            (id, timestamp, tenant_id, action, actor_id, actor_type,
             resource_type, resource_id, ip_address, user_agent, details)
            VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)
            """,
            event.id, event.timestamp, event.tenant_id, event.action,
            event.actor_id, event.actor_type, event.resource_type,
            event.resource_id, event.ip_address, event.user_agent,
            json.dumps(event.details)
        )
        
        # Also stream to Kafka for real-time monitoring
        await self.kafka.produce(
            topic="audit_events",
            key=event.tenant_id,
            value=event.to_dict()
        )

Phase 5: Scaling and Edge Cases (5 minutes)

Interviewer: "How would this system scale if we went from 500 to 5,000 tenants?"

Scaling Strategy

You: "The architecture is designed to scale horizontally. Here's how each component scales:"

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SCALING STRATEGY                                    β”‚
β”‚                                                                        β”‚
β”‚  Component          β”‚ Current      β”‚ 10x Scale    β”‚ How                β”‚
β”‚  ───────────────────┼──────────────┼──────────────┼──────────────────  β”‚
β”‚  API Servers        β”‚ 20 instances β”‚ 200 instancesβ”‚ Auto-scaling group β”‚
β”‚  PostgreSQL         β”‚ 2TB per regionβ”‚ Sharding    β”‚ By tenant_id       β”‚
β”‚  Elasticsearch      β”‚ 3TB per regionβ”‚ 30TB clusterβ”‚ Add nodes          β”‚
β”‚  S3                 β”‚ 180TB        β”‚ 1.8PB       β”‚ Automatic           β”‚
β”‚  Redis              β”‚ 50GB cluster β”‚ 500GB clusterβ”‚ Add shards         β”‚
β”‚  Workers            β”‚ 10 instances β”‚ 100 instancesβ”‚ Queue-based scale  β”‚
β”‚                                                                        β”‚
β”‚  KEY SCALING DECISIONS:                                                β”‚
β”‚                                                                        β”‚
β”‚  1. Database sharding by tenant_id                                     β”‚
β”‚     β”œβ”€β”€ Keeps tenant data together                                     β”‚
β”‚     β”œβ”€β”€ Enables tenant-level backup/restore                            β”‚
β”‚     └── Large tenants can get dedicated shards                         β”‚
β”‚                                                                        β”‚
β”‚  2. Search index per tenant (for large tenants)                        β”‚
β”‚     β”œβ”€β”€ Avoids hot spots                                               β”‚
β”‚     β”œβ”€β”€ Enables tenant-specific tuning                                 β”‚
β”‚     └── Easier to delete/migrate                                       β”‚
β”‚                                                                        β”‚
β”‚  3. Add more regions as needed                                         β”‚
β”‚     β”œβ”€β”€ Japan region for Japanese customers                            β”‚
β”‚     β”œβ”€β”€ Australia region for AU/NZ                                     β”‚
β”‚     └── Each region is independent                                     β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Edge Cases

Interviewer: "What edge cases should we handle?"

You: "Several important edge cases:"

EDGE CASES

1. TENANT OFFBOARDING
   └── Customer cancels subscription
       β”œβ”€β”€ 30-day grace period (can reactivate)
       β”œβ”€β”€ Export data to customer
       β”œβ”€β”€ Then complete deletion
       └── Retain anonymized audit logs

2. LARGE FILE UPLOADS
   └── Customer uploads 500MB presentation
       β”œβ”€β”€ Direct-to-S3 upload (presigned URL)
       β”œβ”€β”€ Chunked upload for resume
       β”œβ”€β”€ Async processing in background
       └── Progress tracking

3. SEARCH INDEX CORRUPTION
   └── Elasticsearch index gets corrupted
       β”œβ”€β”€ Detection: Scheduled consistency checks
       β”œβ”€β”€ Recovery: Rebuild from PostgreSQL
       β”œβ”€β”€ Tenant isolated: Only one tenant affected
       └── Automated healing with alerting

4. CROSS-TENANT SHARING (External links)
   └── User shares document externally
       β”œβ”€β”€ Generate unique, expiring token
       β”œβ”€β”€ Token tied to document, not tenant context
       β”œβ”€β”€ Audit log records external access
       └── Owner can revoke anytime

5. SSO PROVIDER OUTAGE
   └── Customer's SSO is down
       β”œβ”€β”€ Fallback to email/password
       β”œβ”€β”€ Requires pre-configured backup auth
       β”œβ”€β”€ Audit log notes SSO bypass
       └── Notify tenant admin

6. REGULATORY HOLD
   └── Legal hold prevents deletion
       β”œβ”€β”€ Mark documents as "held"
       β”œβ”€β”€ Deletion requests queued
       β”œβ”€β”€ User notified of delay
       └── Release when hold lifted

Phase 6: Monitoring and Operations (5 minutes)

Interviewer: "How would you monitor this system in production?"

Key Metrics

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      MONITORING DASHBOARD                              β”‚
β”‚                                                                        β”‚
β”‚  BUSINESS METRICS (per tenant)                                         β”‚
β”‚  β”œβ”€β”€ Active users                     [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘] 8,234/10,000        β”‚
β”‚  β”œβ”€β”€ Documents stored                 [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 4.2M / 5M           β”‚
β”‚  β”œβ”€β”€ Storage used                     [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘] 720GB / 1TB         β”‚
β”‚  └── API calls today                  [β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘] 42K / 100K          β”‚
β”‚                                                                        β”‚
β”‚  SYSTEM HEALTH (per region)                                            β”‚
β”‚  β”œβ”€β”€ API latency p99                  [β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 180ms (< 200ms)     β”‚
β”‚  β”œβ”€β”€ Search latency p99               [β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘] 320ms (< 500ms)     β”‚
β”‚  β”œβ”€β”€ Error rate                       [β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 0.02% (< 0.1%)      β”‚
β”‚  └── Throughput                       [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘] 800 req/s           β”‚
β”‚                                                                        β”‚
β”‚  SECURITY METRICS                                                      β”‚
β”‚  β”œβ”€β”€ Failed login attempts/hour       [β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 234                 β”‚
β”‚  β”œβ”€β”€ Cross-tenant access attempts     [β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 0 (should be 0!)    β”‚
β”‚  β”œβ”€β”€ WAF blocked requests             [β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘] 1,234/hour          β”‚
β”‚  └── MFA adoption                     [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘] 62%                 β”‚
β”‚                                                                        β”‚
β”‚  COMPLIANCE METRICS                                                    β”‚
β”‚  β”œβ”€β”€ Pending deletion requests        [β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 3                   β”‚
β”‚  β”œβ”€β”€ Avg deletion completion time     [β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 4.2 days            β”‚
β”‚  β”œβ”€β”€ Data export requests/week        [β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 12                  β”‚
β”‚  └── Audit log retention              [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 7 years             β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Alerting Strategy

CRITICAL (PagerDuty - immediate response):
β”œβ”€β”€ Cross-tenant data access detected
β”œβ”€β”€ Database replication lag > 1 minute
β”œβ”€β”€ Error rate > 1%
β”œβ”€β”€ Any region unreachable
└── Security incident detected

WARNING (Slack - business hours):
β”œβ”€β”€ Tenant approaching quota limits
β”œβ”€β”€ Search latency > 400ms p99
β”œβ”€β”€ Deletion request approaching deadline
β”œβ”€β”€ Failed login spike for tenant
└── Certificate expiring < 30 days

INFO (Dashboard only):
β”œβ”€β”€ New tenant onboarded
β”œβ”€β”€ Large file upload completed
β”œβ”€β”€ Scheduled maintenance
└── Feature flag changed

Interview Conclusion

Interviewer: "This is a comprehensive design. You've covered multi-tenancy, compliance, and security thoroughly. Any final thoughts?"

You: "A few things I'd prioritize for implementation:

  1. Start with tenant isolation - Get this right first, it's the foundation
  2. Build compliance into the architecture - Retrofitting GDPR is painful
  3. Invest in observability early - Per-tenant metrics from day one
  4. Plan for enterprise features - Dedicated resources, custom encryption
  5. Security as code - Infrastructure as code, security policies as code

The key insight is that multi-tenant SaaS is harder than single-tenant because every feature needs to consider isolation, fairness, and compliance from the start."

Interviewer: "Excellent. Thanks for walking through this with me."


Summary: Week 9 Concepts Applied

Concepts by Day

Day Topic Application in Design
Day 1 Tenant Isolation Multi-layer isolation (app, DB RLS, storage paths), enterprise dedicated resources
Day 2 Noisy Neighbor Per-tenant quotas, rate limiting, fair scheduling, query guards
Day 3 Data Residency Regional data planes, no cross-region data flow, consent management
Day 4 Right to Deletion Deletion orchestration, verification, anonymization vs delete
Day 5 Security Defense in depth, zero trust, encryption layers, audit logging

Code Patterns Demonstrated

1. TENANT CONTEXT MANAGEMENT
   β”œβ”€β”€ Immutable TenantContext dataclass
   β”œβ”€β”€ ContextVar for thread-safe propagation
   β”œβ”€β”€ Middleware sets context from JWT/subdomain
   └── All services use get_current_tenant()

2. REPOSITORY PATTERN WITH ISOLATION
   β”œβ”€β”€ TenantAwareRepository base class
   β”œβ”€β”€ Auto-adds tenant_id to all queries
   β”œβ”€β”€ RLS as database-level backup
   └── No raw SQL without tenant filter

3. REGIONAL DATA ROUTING
   β”œβ”€β”€ RegionalDataRouter for all data ops
   β”œβ”€β”€ Tenant config determines region
   β”œβ”€β”€ Each region has full stack
   └── Global control plane for metadata only

4. DELETION ORCHESTRATION
   β”œβ”€β”€ DeletionService coordinates
   β”œβ”€β”€ System-specific executors
   β”œβ”€β”€ Verification confirms deletion
   └── Audit trail survives deletion

5. DEFENSE IN DEPTH SECURITY
   β”œβ”€β”€ Edge β†’ Network β†’ App β†’ Data layers
   β”œβ”€β”€ Each layer assumes others might fail
   β”œβ”€β”€ Zero trust between services
   └── Comprehensive audit logging

Self-Assessment Checklist

After studying this capstone, you should be able to:

  • Design multi-tenant systems with proper data isolation
  • Implement Row-Level Security in PostgreSQL
  • Build per-tenant rate limiting and quota systems
  • Architect for data residency requirements
  • Handle GDPR consent, export, and deletion
  • Design defense-in-depth security architecture
  • Implement comprehensive audit logging
  • Scale multi-tenant systems horizontally
  • Handle edge cases like tenant offboarding
  • Monitor multi-tenant systems with per-tenant metrics

This capstone integrates all concepts from Week 9: Multi-Tenancy, Security, and Compliance. Use this as a template for approaching enterprise SaaS system design interviews.