Himanshu Kukreja
0%
LearnSystem DesignWeek 6Notification Platform Design
Day 01

Week 6 — Day 1: Notification Platform — Problem Understanding & High-Level Design

System Design Mastery Series — Practical Application Week


Introduction

Welcome to Week 6 — our first practical application week. For the next five days, we'll design a complete Multi-Channel Notification Platform from scratch.

This isn't just another system design exercise. By the end of this week, you'll:

  • Know how to design a notification system that handles 500M notifications/day
  • Handle every edge case interviewers throw at you
  • Understand how companies like Uber, Slack, and WhatsApp built their systems
  • Write production-ready code for critical components

Today's Theme: "Before you solve it, understand it deeply"

┌────────────────────────────────────────────────────────────────────────┐
│                    DAY 1 ROADMAP                                       │
│                                                                        │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐                │
│  │  Interview   │──▶│   Domain     │──▶│  Estimation  │                │
│  │  Approach    │   │  Deep Dive   │   │              │                │
│  └──────────────┘   └──────────────┘   └──────────────┘                │
│         │                                      │                       │
│         ▼                                      ▼                       │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐                │
│  │  High-Level  │◀──│  Technology  │◀──│    Schema    │                │
│  │  Architecture│   │   Choices    │   │    Design    │                │
│  └──────────────┘   └──────────────┘   └──────────────┘                │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Part I: The Interview Approach

Chapter 1: How to Start a System Design Interview

The first 5-10 minutes of a system design interview are crucial. Most candidates make the mistake of jumping straight into drawing boxes. Let's learn the right approach.

1.1 The Opening

You walk into the interview room. The interviewer writes on the whiteboard:

╔══════════════════════════════════════════════════════════════════════════╗
║                                                                          ║
║              Design a Notification Platform                              ║
║                                                                          ║
║   Users should receive notifications via multiple channels               ║
║   (push, email, SMS, etc.)                                               ║
║                                                                          ║
╚══════════════════════════════════════════════════════════════════════════╝

Interviewer: "Take your time, think about it, and walk me through your approach."

1.2 Don't Start Drawing Yet!

❌ WRONG APPROACH:

*Immediately starts drawing boxes*
"So we'll have an API gateway here, and a queue here, and a database..."

Problems:
- You don't know the requirements
- You're making assumptions
- You'll have to backtrack
- Shows lack of structured thinking


✅ RIGHT APPROACH:

*Takes a breath, grabs marker*
"Before I start designing, I'd like to understand the problem better.
Let me ask a few clarifying questions to make sure I'm solving the
right problem..."

Shows:
- Structured thinking
- Real-world experience (requirements always need clarification)
- Communication skills
- You won't waste time on wrong assumptions

1.3 The Clarifying Questions Framework

Use the FERNS framework for clarifying questions:

F - Functional requirements (What should it do?)
E - Edge cases (What are the tricky scenarios?)
R - Resources/constraints (What are the limits?)
N - Non-functional requirements (How well should it do it?)
S - Scale (How big is it?)

Chapter 2: Requirements Clarification (The Full Dialogue)

Let's walk through exactly how this conversation should go:

2.1 Functional Requirements

You: "Let me start with understanding what the system needs to do. What types of notifications are we sending?"

Interviewer: "We're a fintech company — think of something like Revolut or Cash App. We send transaction notifications, security alerts, marketing campaigns, reminders, and social notifications."

You: "Got it. And which channels do we need to support?"

Interviewer: "Push notifications, email, SMS, in-app notifications, and we'd like WhatsApp in the future."

You: "For push notifications, are we targeting both iOS and Android?"

Interviewer: "Yes, both platforms. We also have a web app that should show in-app notifications."

You: "Can a single notification go to multiple channels? Like, if a suspicious login happens, do we send push AND SMS?"

Interviewer: "Yes, exactly. Critical security alerts should go to multiple channels. Users should also be able to set their preferences for which channels they want."

You: "Speaking of preferences, what level of control do users have?"

Interviewer: "Users can opt-in/out per category, choose preferred channels, set quiet hours, and unsubscribe from marketing. We need to respect these preferences and also comply with regulations."

You: "Do we need to support scheduling notifications for later? Or notification digests?"

Interviewer: "Yes, both. We want to send daily transaction summaries, and some notifications should be scheduled — like payment reminders."

2.2 Scale Questions

You: "Let me understand the scale. How many users do we have?"

Interviewer: "50 million registered users, about 10 million daily active."

You: "And roughly how many notifications per day?"

Interviewer: "On a normal day, about 500 million. But when marketing runs campaigns, we might need to send 10 million notifications in an hour."

You: "That's a significant spike. Are campaigns pre-scheduled or ad-hoc?"

Interviewer: "Both. Marketing schedules campaigns in advance, but we also do real-time triggered campaigns based on user behavior."

You: "What's the geographic distribution? Are users global or concentrated?"

Interviewer: "Global — US, Europe, Asia. We need to handle time zones properly."

2.3 Non-Functional Requirements

You: "What are the latency requirements? How fast should a notification be delivered after it's triggered?"

Interviewer: "For transaction notifications, within 1-2 seconds. Security alerts should be near-instant. Marketing can be slightly delayed — within a minute is fine."

You: "What about delivery reliability? What's an acceptable failure rate?"

Interviewer: "For critical notifications like security alerts, we need 99.9% delivery. For marketing, we can accept some loss. Overall, we want to track and optimize delivery rates."

You: "If delivery fails on one channel, should we try another?"

Interviewer: "Yes! That's important. If push fails, fall back to SMS for critical notifications."

You: "Any compliance requirements I should know about?"

Interviewer: "GDPR in Europe, CAN-SPAM for emails, and we need an audit trail for all notifications — especially financial ones."

2.4 Edge Cases and Priorities

You: "A few edge cases: What if a user has push disabled on their phone? What if a phone number becomes invalid?"

Interviewer: "Good questions. We need to detect these and either use fallback channels or mark the contact info as invalid."

You: "What's the priority if we have limited resources — should we prioritize transaction notifications over marketing?"

Interviewer: "Absolutely. Transaction and security notifications should never be delayed by marketing campaigns."

You: "One more — do we need to support templates? Or is each notification crafted individually?"

Interviewer: "Templates for sure. We have dozens of notification types, each with templates in multiple languages."


Chapter 3: Summarizing Requirements

You: "Let me summarize what I've understood to make sure we're aligned."

┌────────────────────────────────────────────────────────────────────────┐
│                    REQUIREMENTS SUMMARY                                │
│                                                                        │
│  FUNCTIONAL REQUIREMENTS                                               │
│                                                                        │
│  1. NOTIFICATION TYPES                                                 │
│     ├── Transaction (payment sent/received, refunds)                   │
│     ├── Security (login alerts, password changes, suspicious activity) │
│     ├── Marketing (promotions, new features, campaigns)                │
│     ├── Reminders (bill due, low balance, scheduled payments)          │
│     └── Social (friend requests, payment requests, splits)             │
│                                                                        │
│  2. CHANNELS                                                           │
│     ├── Push (iOS APNs + Android FCM)                                  │
│     ├── Email (transactional + marketing)                              │
│     ├── SMS (critical alerts, OTP)                                     │
│     ├── In-App (notification center, real-time)                        │
│     └── WhatsApp (future, via Business API)                            │
│                                                                        │
│  3. USER PREFERENCES                                                   │
│     ├── Per-category opt-in/out                                        │
│     ├── Channel preferences                                            │
│     ├── Quiet hours (do not disturb)                                   │
│     ├── Frequency settings                                             │
│     └── Unsubscribe (with legal compliance)                            │
│                                                                        │
│  4. ADVANCED FEATURES                                                  │
│     ├── Multi-channel delivery (push + SMS for critical)               │
│     ├── Fallback channels (push fails → try SMS)                       │
│     ├── Scheduled notifications                                        │
│     ├── Digests (daily summaries)                                      │
│     ├── Templates with localization                                    │
│     └── Audit trail                                                    │
│                                                                        │
│  NON-FUNCTIONAL REQUIREMENTS                                           │
│                                                                        │
│  1. SCALE                                                              │
│     ├── 50M users, 10M DAU                                             │
│     ├── 500M notifications/day (normal)                                │
│     ├── 10M notifications/hour (campaign spikes)                       │
│     └── Global distribution (multi-region)                             │
│                                                                        │
│  2. LATENCY                                                            │
│     ├── Transaction notifications: <2 seconds                          │
│     ├── Security alerts: <1 second                                     │
│     └── Marketing: <1 minute                                           │
│                                                                        │
│  3. RELIABILITY                                                        │
│     ├── Critical notifications: 99.9% delivery                         │
│     ├── Retry with fallback channels                                   │
│     └── No duplicate notifications                                     │
│                                                                        │
│  4. COMPLIANCE                                                         │
│     ├── GDPR (EU data protection)                                      │
│     ├── CAN-SPAM (email opt-out)                                       │
│     └── Full audit trail                                               │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Interviewer: "That's a great summary. Let's proceed with the design."


Part II: Domain Deep Dive

Before designing, let's understand the notification domain deeply. This knowledge will inform our design decisions.

Chapter 4: Notification Types and Characteristics

4.1 Notification Classification

NOTIFICATION CLASSIFICATION MATRIX

┌─────────────────┬───────────┬──────────┬───────────┬────────────┐
│ Type            │ Priority  │ Latency  │ Channels  │ Compliance │
├─────────────────┼───────────┼──────────┼───────────┼────────────┤
│ Transaction     │ HIGH      │ <2s      │ Push,InApp│ Audit      │
│ Security        │ CRITICAL  │ <1s      │ Push+SMS  │ Audit,Log  │
│ Marketing       │ LOW       │ <1m      │ Email,Push│ CAN-SPAM   │
│ Reminders       │ MEDIUM    │ Scheduled│ Push,Email│ Opt-out    │
│ Social          │ MEDIUM    │ <5s      │ Push,InApp│ Opt-out    │
└─────────────────┴───────────┴──────────┴───────────┴────────────┘

PRIORITY IMPLICATIONS:

CRITICAL:
├── Never queued behind lower priority
├── Multiple channels simultaneously
├── Maximum retry attempts
├── Alert on delivery failure
└── Examples: Fraud alert, suspicious login

HIGH:
├── Processed before MEDIUM/LOW
├── Single channel with fallback
├── Standard retries
└── Examples: Payment received, transfer complete

MEDIUM:
├── Standard processing
├── Single channel
├── Limited retries
└── Examples: Friend request, reminder

LOW:
├── Can be delayed during high load
├── Can be batched
├── Minimal retries
└── Examples: Marketing, newsletters

4.2 Channel Characteristics

Each channel has different characteristics that affect our design:

CHANNEL COMPARISON

┌────────────────────────────────────────────────────────────────────────┐
│                    PUSH NOTIFICATIONS                                  │
│                                                                        │
│  Providers:        FCM (Android), APNs (iOS)                           │
│  Delivery:         Best-effort (phone might be off)                    │
│  Confirmation:     Delivery receipt, NOT read receipt                  │
│  Cost:             Free (infrastructure cost only)                     │
│  Rate Limits:      FCM: 1000/sec per project, APNs: no hard limit      │
│  User Control:     Can disable at OS level                             │
│  Payload:          ~4KB (FCM), ~4KB (APNs)                             │
│  Rich Content:     Images, actions, deep links                         │
│                                                                        │
│  CHALLENGES:                                                           │
│  - Token rotation (need to handle InvalidToken)                        │
│  - User can disable without us knowing                                 │
│  - Delivery not guaranteed (phone off, no internet)                    │
│  - Different APIs for iOS/Android                                      │
└────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────┐
│                    EMAIL                                               │
│                                                                        │
│  Providers:        SendGrid, Amazon SES, Mailgun, Postmark             │
│  Delivery:         Reliable but can go to spam                         │
│  Confirmation:     Delivery, bounce, open, click tracking              │
│  Cost:             $0.10 - $1.00 per 1000 emails                       │
│  Rate Limits:      Varies (SES: 14/sec default, up to 500/sec)         │
│  User Control:     Unsubscribe link required (CAN-SPAM)                │
│  Payload:          Large (HTML, attachments)                           │
│  Rich Content:     Full HTML, images, attachments                      │
│                                                                        │
│  CHALLENGES:                                                           │
│  - Spam filters (reputation management critical)                       │
│  - Bounces (hard vs soft)                                              │
│  - Open/click tracking unreliable (image blocking)                     │
│  - Slow delivery (can take minutes)                                    │
│  - Regulatory compliance                                               │
└────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────┐
│                    SMS                                                 │
│                                                                        │
│  Providers:        Twilio, AWS SNS, Vonage, MessageBird                │
│  Delivery:         High reliability                                    │
│  Confirmation:     Delivery receipt available                          │
│  Cost:             $0.01 - $0.10 per message (expensive!)              │
│  Rate Limits:      Varies by provider and number type                  │
│  User Control:     Opt-out required (reply STOP)                       │
│  Payload:          160 characters (or 70 for Unicode)                  │
│  Rich Content:     None (text only, limited length)                    │
│                                                                        │
│  CHALLENGES:                                                           │
│  - EXPENSIVE (5-100x more than other channels)                         │
│  - Character limits                                                    │
│  - Phone number validation                                             │
│  - International delivery (country-specific rules)                     │
│  - Carrier filtering (spam detection)                                  │
│  - Number registration requirements (US: A2P 10DLC)                    │
└────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────┐
│                    IN-APP                                              │
│                                                                        │
│  Delivery:         WebSocket or polling                                │
│  Confirmation:     Immediate (user online) or queued                   │
│  Cost:             Infrastructure only                                 │
│  Rate Limits:      Self-imposed                                        │
│  User Control:     Per-category settings                               │
│  Payload:          Unlimited                                           │
│  Rich Content:     Full UI components                                  │
│                                                                        │
│  CHALLENGES:                                                           │
│  - User must be in app                                                 │
│  - Need notification center for offline users                          │
│  - Real-time sync across devices                                       │
│  - Read/unread state management                                        │
└────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────┐
│                    WHATSAPP (Business API)                             │
│                                                                        │
│  Providers:        WhatsApp Business API (via partners)                │
│  Delivery:         Very high (user's primary messaging app)            │
│  Confirmation:     Delivered, read receipts                            │
│  Cost:             $0.005 - $0.10 per message (template-based)         │
│  Rate Limits:      Tiered based on quality rating                      │
│  User Control:     Must opt-in, can block                              │
│  Payload:          1024 characters, media supported                    │
│  Rich Content:     Images, documents, buttons                          │
│                                                                        │
│  CHALLENGES:                                                           │
│  - Template approval required (24-48 hours)                            │
│  - Strict opt-in requirements                                          │
│  - Quality rating affects rate limits                                  │
│  - 24-hour messaging window for non-template                           │
└────────────────────────────────────────────────────────────────────────┘

4.3 Channel Selection Logic

# Channel Selection Decision Tree

class ChannelSelector:
    """
    Determines which channels to use for a notification.
    
    Considers:
    - Notification priority and type
    - User preferences
    - Channel availability
    - Time of day (quiet hours)
    - Cost optimization
    """
    
    def select_channels(
        self,
        notification: Notification,
        user: User,
        preferences: UserPreferences
    ) -> list[ChannelConfig]:
        """
        Select channels for notification delivery.
        
        Returns ordered list of channels to try.
        """
        channels = []
        
        # Step 1: Determine required channels based on notification type
        required = self._get_required_channels(notification)
        
        # Step 2: Filter by user preferences
        allowed = self._filter_by_preferences(required, preferences)
        
        # Step 3: Check quiet hours
        if self._is_quiet_hours(user, preferences):
            allowed = self._filter_quiet_hours_channels(allowed, notification)
        
        # Step 4: Check channel availability (device tokens, etc.)
        available = self._filter_by_availability(allowed, user)
        
        # Step 5: Order by priority (primary channel first, fallbacks after)
        ordered = self._order_channels(available, notification)
        
        # Step 6: Add fallback channels for critical notifications
        if notification.priority == Priority.CRITICAL:
            ordered = self._add_fallbacks(ordered, user)
        
        return ordered
    
    def _get_required_channels(self, notification: Notification) -> list[Channel]:
        """Get channels based on notification type."""
        
        CHANNEL_MATRIX = {
            NotificationType.SECURITY_ALERT: {
                'primary': [Channel.PUSH, Channel.SMS],  # Both simultaneously
                'fallback': [Channel.EMAIL]
            },
            NotificationType.TRANSACTION: {
                'primary': [Channel.PUSH],
                'fallback': [Channel.SMS, Channel.EMAIL]
            },
            NotificationType.MARKETING: {
                'primary': [Channel.EMAIL],
                'fallback': [Channel.PUSH]  # Only if user opted in
            },
            NotificationType.REMINDER: {
                'primary': [Channel.PUSH],
                'fallback': [Channel.EMAIL]
            },
            NotificationType.SOCIAL: {
                'primary': [Channel.PUSH, Channel.IN_APP],
                'fallback': []
            }
        }
        
        return CHANNEL_MATRIX.get(notification.type, {})
    
    def _is_quiet_hours(self, user: User, prefs: UserPreferences) -> bool:
        """Check if current time is in user's quiet hours."""
        if not prefs.quiet_hours_enabled:
            return False
        
        user_time = self._get_user_local_time(user)
        return prefs.quiet_hours_start <= user_time.time() <= prefs.quiet_hours_end
    
    def _filter_quiet_hours_channels(
        self,
        channels: list[Channel],
        notification: Notification
    ) -> list[Channel]:
        """Filter channels during quiet hours."""
        
        if notification.priority == Priority.CRITICAL:
            # Critical notifications bypass quiet hours
            return channels
        
        # During quiet hours, only use non-intrusive channels
        NON_INTRUSIVE = {Channel.EMAIL, Channel.IN_APP}
        return [c for c in channels if c in NON_INTRUSIVE]

Chapter 5: Regulatory and Compliance Requirements

5.1 Email Regulations

EMAIL COMPLIANCE

CAN-SPAM (United States):
├── Physical address required in email
├── Clear unsubscribe link (must work for 30 days)
├── Process opt-outs within 10 business days
├── No misleading headers or subject lines
├── Identify message as advertisement
└── Penalty: Up to $46,517 per email violation

GDPR (European Union):
├── Explicit consent required for marketing
├── Easy withdrawal of consent
├── Right to data access and deletion
├── Data processing records
├── Cross-border data transfer restrictions
└── Penalty: Up to €20M or 4% of global revenue

CASL (Canada):
├── Express or implied consent required
├── Clear identification of sender
├── Unsubscribe mechanism
├── Record consent and opt-outs
└── Penalty: Up to $10M per violation


IMPLEMENTATION REQUIREMENTS:

1. Consent Management
   ├── Track consent timestamp and source
   ├── Double opt-in for marketing
   ├── Easy one-click unsubscribe
   └── Consent audit trail

2. Email Content
   ├── Physical address in footer
   ├── Unsubscribe link
   ├── Clear sender identification
   └── Accurate subject lines

3. Data Handling
   ├── Process opt-outs immediately
   ├── Don't share data without consent
   ├── Support data deletion requests
   └── Keep consent records for 3+ years

5.2 SMS Regulations

SMS COMPLIANCE

TCPA (United States):
├── Express written consent for marketing SMS
├── Clear opt-in disclosure
├── Easy opt-out (STOP keyword)
├── No SMS to numbers on Do Not Call registry
├── Time restrictions (8am-9pm local time)
└── Penalty: $500-$1,500 per message

A2P 10DLC (US Application-to-Person):
├── Brand registration required
├── Campaign registration required
├── Use case verification
├── Throughput based on trust score
└── Violations can result in filtering


IMPLEMENTATION REQUIREMENTS:

1. Consent
   ├── Separate SMS consent from email
   ├── Document consent method
   ├── Honor STOP requests immediately
   └── Confirm opt-out with final message

2. Content
   ├── Include opt-out instructions
   ├── Identify sender
   ├── No prohibited content
   └── Respect character limits

3. Timing
   ├── Check recipient timezone
   ├── No messages before 8am or after 9pm
   └── Respect frequency limits

5.3 Push Notification Best Practices

PUSH NOTIFICATION GUIDELINES (Platform-Enforced)

iOS (Apple):
├── Request permission (shown once!)
├── No silent push abuse
├── Respect notification settings
├── Provisional authorization available
└── Violation: App Store rejection

Android (Google):
├── FCM terms of service
├── No crypto mining in background
├── Battery usage guidelines
├── Data message vs notification message
└── Violation: Play Store suspension


USER EXPERIENCE BEST PRACTICES:

1. Asking for Permission
   ├── Don't ask on first app open
   ├── Explain value before asking
   ├── Use soft-ask first (in-app prompt)
   └── Respect "not now" choice

2. Content
   ├── Relevant and timely
   ├── Clear and concise
   ├── Actionable
   └── Personalized

3. Frequency
   ├── Don't over-notify
   ├── Group related notifications
   ├── Smart timing based on engagement
   └── Track and optimize

Part III: Back of the Envelope Estimation

Chapter 6: Traffic and Storage Calculations

You: "Let me estimate the traffic and storage requirements."

6.1 Traffic Estimation

NOTIFICATION TRAFFIC

Daily volume: 500M notifications
Hourly (average): 500M / 24 = ~21M notifications/hour
Per second (average): 21M / 3600 = ~5,800 notifications/sec

Peak (campaign mode): 10M notifications/hour
Per second (peak): 10M / 3600 = ~2,800 notifications/sec

But notifications are bursty:
- Most campaigns target specific times (10am, 2pm, 7pm)
- Transaction notifications spike during business hours
- Some regions have higher activity at certain times

Peak burst: 3x average = ~8,700 notifications/sec
Design for: 10,000 notifications/sec (headroom)


BREAKDOWN BY TYPE (estimated):

Transaction:     40% = 200M/day = ~2,300/sec avg
Security:         5% = 25M/day  = ~290/sec avg
Marketing:       30% = 150M/day = bursty, up to 5,000/sec
Reminders:       10% = 50M/day  = ~580/sec avg
Social:          15% = 75M/day  = ~870/sec avg


BREAKDOWN BY CHANNEL (estimated):

Push:           60% = 300M/day
Email:          25% = 125M/day
In-App:         10% = 50M/day
SMS:             5% = 25M/day (expensive, used sparingly)


API TRAFFIC:

Notification send requests: ~10,000/sec (peak)
Preference reads: ~5,000/sec (cached)
Preference updates: ~100/sec
Status queries: ~1,000/sec
Template fetches: ~500/sec (cached)

6.2 Storage Estimation

STORAGE REQUIREMENTS

1. NOTIFICATION RECORDS

Per notification:
├── notification_id:     16 bytes (UUID)
├── user_id:            16 bytes (UUID)
├── type:                4 bytes (enum)
├── priority:            4 bytes (enum)
├── channel:             4 bytes (enum)
├── template_id:        16 bytes (UUID)
├── variables:         200 bytes (JSON, avg)
├── status:              4 bytes (enum)
├── created_at:          8 bytes (timestamp)
├── sent_at:             8 bytes (timestamp)
├── delivered_at:        8 bytes (timestamp)
├── metadata:          100 bytes (JSON)
└── Total:            ~400 bytes

Daily: 500M × 400 bytes = 200GB/day
Monthly: 6TB
Yearly: 72TB

Retention: 90 days for queryable, 7 years archived
Active storage: 90 × 200GB = 18TB


2. USER PREFERENCES

Per user:
├── user_id:            16 bytes
├── global_enabled:      1 byte
├── channel_prefs:      50 bytes (JSON)
├── category_prefs:    100 bytes (JSON)
├── quiet_hours:        20 bytes
├── timezone:           32 bytes
├── updated_at:          8 bytes
└── Total:            ~230 bytes

Total: 50M users × 230 bytes = 11.5GB
With indexes: ~20GB


3. DEVICE TOKENS

Per device:
├── token_id:           16 bytes
├── user_id:            16 bytes
├── platform:            4 bytes
├── token:             256 bytes (FCM/APNs token)
├── app_version:        16 bytes
├── created_at:          8 bytes
├── last_used:           8 bytes
└── Total:            ~324 bytes

Avg 2 devices per user: 50M × 2 × 324 = 32GB


4. TEMPLATES

Per template:
├── template_id:        16 bytes
├── name:               64 bytes
├── type:                4 bytes
├── channel:             4 bytes
├── content:          2000 bytes (with localization)
├── variables_schema:  500 bytes
├── version:             4 bytes
├── created_at:          8 bytes
└── Total:           ~2.6KB

~500 templates × 2.6KB = 1.3MB (negligible)


5. AUDIT LOGS

Per log entry:
├── log_id:             16 bytes
├── notification_id:    16 bytes
├── action:             32 bytes
├── details:           200 bytes
├── timestamp:           8 bytes
└── Total:            ~270 bytes

500M notifications × 3 events avg = 1.5B entries/day
Daily: 1.5B × 270 bytes = 400GB/day
Yearly: 146TB (need archival strategy)


STORAGE SUMMARY:

┌───────────────────────────────────────────────────────────────────────┐
│                    STORAGE REQUIREMENTS                               │
│                                                                       │
│  Data Type              │ Size          │ Storage Type                │
│  ───────────────────────┼───────────────┼──────────────────────────── │
│  Notifications (90 day) │ 18 TB         │ PostgreSQL (partitioned)    │
│  User Preferences       │ 20 GB         │ PostgreSQL                  │
│  Device Tokens          │ 32 GB         │ PostgreSQL                  │
│  Templates              │ 1.3 MB        │ PostgreSQL + Redis cache    │
│  Audit Logs (30 day)    │ 12 TB         │ ClickHouse / S3 + Athena    │
│  Audit Archive (7 year) │ 1 PB          │ S3 Glacier                  │
│                                                                       │
│  Total Active: ~30 TB                                                 │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

6.3 Provider Cost Estimation

MONTHLY PROVIDER COSTS

Push (FCM + APNs):
├── Cost: Free
├── Infrastructure cost only
└── ~$0 (included in compute)

Email (SendGrid/SES):
├── 125M emails/day × 30 = 3.75B/month
├── At $0.10/1000 = $375,000/month
└── With volume discount: ~$150,000/month

SMS (Twilio):
├── 25M SMS/day × 30 = 750M/month
├── At $0.01/SMS = $7,500,000/month (!)
├── With negotiated rates: ~$3,000,000/month
└── This is why SMS is used sparingly

In-App:
├── Cost: Infrastructure only
└── WebSocket servers: ~$10,000/month


COST OPTIMIZATION STRATEGIES:

1. Minimize SMS
   ├── Use push as primary
   ├── SMS only for critical + push failure
   └── Potential savings: $2M+/month

2. Email batching
   ├── Digest emails instead of individual
   └── Potential savings: 30-50%

3. Multi-provider strategy
   ├── Route based on cost + reliability
   └── Negotiate volume discounts

4. Smart channel selection
   ├── Push first (free)
   ├── Email second (cheap)
   ├── SMS last resort (expensive)

Part IV: High-Level Architecture

Chapter 7: System Architecture Design

You: "Now let me design the high-level architecture."

7.1 Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────────┐
│                         NOTIFICATION PLATFORM ARCHITECTURE                      │
│                                                                                 │
│                                                                                 │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                      │
│   │  Payment     │    │  Security    │    │  Marketing   │                      │
│   │  Service     │    │  Service     │    │  Service     │                      │
│   └──────┬───────┘    └──────┬───────┘    └──────┬───────┘                      │
│          │                   │                   │                              │
│          │    INTERNAL EVENTS (Kafka)            │                              │
│          └───────────────────┼───────────────────┘                              │
│                              │                                                  │
│                              ▼                                                  │
│   ┌────────────────────────────────────────────────────────────────────────┐    │
│   │                      NOTIFICATION SERVICE                              │    │
│   │  ┌───────────────────────────────────────────────────────────────────┐ │    │
│   │  │                         API LAYER                                 │ │    │
│   │  │  ┌──────────┐  ┌───────────┐  ┌──────────┐  ┌──────────┐          │ │    │
│   │  │  │  Send    │  │Preferences│  │  Status  │  │ Template │          │ │    │
│   │  │  │  API     │  │   API     │  │   API    │  │   API    │          │ │    │
│   │  │  └──────────┘  └──────── ──┘  └──────────┘  └──────────┘          │ │    │
│   │  └───────────────────────────────────────────────────────────────────┘ │    │
│   │                              │                                         │    │
│   │                              ▼                                         │    │
│   │  ┌───────────────────────────────────────────────────────────────────┐ │    │
│   │  │                    PROCESSING LAYER                               │ │    │
│   │  │                                                                   │ │    │
│   │  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │ │    │
│   │  │  │  Validator   │  │   Router     │  │   Enricher   │             │ │    │
│   │  │  │              │  │              │  │              │             │ │    │
│   │  │  └──────────────┘  └──────────────┘  └──────────────┘             │ │    │
│   │  │                                                                   │ │    │
│   │  └───────────────────────────────────────────────────────────────────┘ │    │
│   └────────────────────────────────────────────────────────────────────────┘    │
│                              │                                                  │
│                              ▼                                                  │
│          ┌─────────────────────────────────────────────────────┐                │
│          │              KAFKA (Notification Topics)            │                │
│          │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐    │                │
│          │  │CRITICAL │ │  HIGH   │ │ MEDIUM  │ │   LOW   │    │                │
│          │  │ Topic   │ │  Topic  │ │  Topic  │ │  Topic  │    │                │
│          │  └─────────┘ └─────────┘ └─────────┘ └─────────┘    │                │
│          └─────────────────────────────────────────────────────┘                │
│                              │                                                  │
│              ┌───────────────┼────────────────┬──────────────────┐              │
│              │               │                │                  │              │
│              ▼               ▼                ▼                  ▼              │
│   ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐           │
│   │    Push      │ │    Email     │ │     SMS      │ │   In-App     │           │
│   │   Workers    │ │   Workers    │ │   Workers    │ │   Workers    │           │
│   └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘           │
│          │               │                │                  │                  │
│          ▼               ▼                ▼                  ▼                  │
│   ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐           │
│   │  FCM / APNs  │ │   SendGrid   │ │    Twilio    │ │  WebSocket   │           │
│   │              │ │   / SES      │ │              │ │   Gateway    │           │
│   └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘           │
│                                                                                 │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │                           DATA LAYER                                  │     │
│   │                                                                       │     │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐   │     │
│   │  │ PostgreSQL  │  │    Redis    │  │ ClickHouse  │  │     S3      │   │     │
│   │  │(Preferences,│  │  (Cache,    │  │  (Audit,    │  │ (Archives,  │   │     │
│   │  │ Tokens)     │  │   Queues)   │  │  Analytics) │  │  Templates) │   │     │
│   │  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘   │     │
│   │                                                                       │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

7.2 Component Responsibilities

COMPONENT BREAKDOWN

┌────────────────────────────────────────────────────────────────────────┐
│                    API LAYER                                           │
│                                                                        │
│  Send API:                                                             │
│  ├── Accept notification requests from internal services               │
│  ├── Validate request payload                                          │
│  ├── Generate notification ID (idempotency key)                        │
│  ├── Publish to transactional outbox                                   │
│  └── Return notification ID immediately                                │
│                                                                        │
│  Preferences API:                                                      │
│  ├── CRUD for user notification preferences                            │
│  ├── Opt-in/opt-out management                                         │
│  ├── Quiet hours configuration                                         │
│  └── Channel preference management                                     │
│                                                                        │
│  Status API:                                                           │
│  ├── Query notification delivery status                                │
│  ├── List notifications for a user                                     │
│  └── Retry failed notifications                                        │
│                                                                        │
│  Template API:                                                         │
│  ├── CRUD for notification templates                                   │
│  ├── Template versioning                                               │
│  ├── Preview and validation                                            │
│  └── Localization management                                           │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────┐
│                    PROCESSING LAYER                                    │
│                                                                        │
│  Validator:                                                            │
│  ├── Validate notification payload                                     │
│  ├── Check user exists                                                 │
│  ├── Verify template exists                                            │
│  └── Apply rate limiting                                               │
│                                                                        │
│  Router:                                                               │
│  ├── Fetch user preferences                                            │
│  ├── Determine target channels                                         │
│  ├── Apply quiet hours rules                                           │
│  ├── Check channel availability                                        │
│  └── Create per-channel notification tasks                             │
│                                                                        │
│  Enricher:                                                             │
│  ├── Fetch template                                                    │
│  ├── Apply template variables                                          │
│  ├── Localize content                                                  │
│  ├── Personalize message                                               │
│  └── Fetch device tokens (for push)                                    │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────┐
│                    DELIVERY LAYER (Workers)                            │
│                                                                        │
│  Push Workers:                                                         │
│  ├── Consume from push topic                                           │
│  ├── Format for FCM/APNs                                               │
│  ├── Send to provider                                                  │
│  ├── Handle responses (success, invalid token, rate limit)             │
│  ├── Update delivery status                                            │
│  └── Publish to DLQ on persistent failure                              │
│                                                                        │
│  Email Workers:                                                        │
│  ├── Consume from email topic                                          │
│  ├── Render HTML template                                              │
│  ├── Send via provider (SendGrid/SES)                                  │
│  ├── Handle bounces and complaints                                     │
│  └── Update delivery status                                            │
│                                                                        │
│  SMS Workers:                                                          │
│  ├── Consume from SMS topic                                            │
│  ├── Format message (character limits)                                 │
│  ├── Check phone number validity                                       │
│  ├── Send via provider (Twilio)                                        │
│  └── Handle delivery receipts                                          │
│                                                                        │
│  In-App Workers:                                                       │
│  ├── Consume from in-app topic                                         │
│  ├── Store in notification center                                      │
│  ├── Push via WebSocket (if user online)                               │
│  └── Update read/unread state                                          │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

7.3 Data Flow

NOTIFICATION FLOW (Detailed)

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  1. INGESTION                                                          │
│     Payment Service ──▶ Send API                                       │
│     {                                                                  │
│       "user_id": "user-123",                                           │
│       "type": "TRANSACTION",                                           │
│       "template": "payment_received",                                  │
│       "variables": {"amount": "$50.00", "sender": "John"}              │
│     }                                                                  │
│                                                                        │
│  2. VALIDATION & PERSISTENCE                                           │
│     Send API:                                                          │
│     ├── Generate notification_id                                       │
│     ├── Validate payload                                               │
│     ├── Write to outbox table (same transaction as notification)       │
│     └── Return notification_id                                         │
│                                                                        │
│  3. RELIABLE PUBLISHING                                                │
│     Outbox Processor (runs every 100ms):                               │
│     ├── Read unpublished from outbox                                   │
│     ├── Publish to Kafka (HIGH priority topic)                         │
│     └── Mark as published                                              │
│                                                                        │
│  4. ROUTING                                                            │
│     Router Consumer:                                                   │
│     ├── Consume from Kafka                                             │
│     ├── Fetch user preferences (Redis cache → PostgreSQL)              │
│     ├── User wants: Push (primary), Email (fallback)                   │
│     ├── Check: Push enabled? Yes. Quiet hours? No.                     │
│     ├── Create push task → publish to push topic                       │
│     └── Store fallback config for potential retry                      │
│                                                                        │
│  5. ENRICHMENT                                                         │
│     Enricher (part of channel worker):                                 │
│     ├── Fetch template "payment_received" (Redis cache)                │
│     ├── Apply variables: "You received $50.00 from John"               │
│     ├── Fetch device tokens for user-123                               │
│     └── Create final push payload                                      │
│                                                                        │
│  6. DELIVERY                                                           │
│     Push Worker:                                                       │
│     ├── Send to FCM                                                    │
│     ├── FCM returns: success (message_id: xyz)                         │
│     ├── Update status: DELIVERED                                       │
│     └── Write audit log                                                │
│                                                                        │
│  7. CALLBACK (if applicable)                                           │
│     FCM Callback (async):                                              │
│     ├── Delivery receipt received                                      │
│     └── Update status: RECEIVED_BY_DEVICE                              │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘


FAILURE FLOW (Push fails, fallback to Email)

┌────────────────────────────────────────────────────────────────────────┐
│                                                                        │
│  6. DELIVERY (failure)                                                 │
│     Push Worker:                                                       │
│     ├── Send to FCM                                                    │
│     ├── FCM returns: InvalidToken                                      │
│     ├── Mark device token as invalid                                   │
│     ├── Check: Has fallback? Yes (Email)                               │
│     ├── Publish to email topic                                         │
│     └── Update status: PUSH_FAILED_FALLBACK_EMAIL                      │
│                                                                        │
│  7. FALLBACK DELIVERY                                                  │
│     Email Worker:                                                      │
│     ├── Consume from email topic                                       │
│     ├── Render email template                                          │
│     ├── Send via SendGrid                                              │
│     ├── SendGrid returns: queued                                       │
│     └── Update status: EMAIL_QUEUED                                    │
│                                                                        │
│  8. EMAIL CALLBACK (async, hours later)                                │
│     SendGrid Webhook:                                                  │
│     ├── Event: delivered                                               │
│     └── Update status: DELIVERED_VIA_EMAIL                             │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Chapter 8: Technology Choices

8.1 Technology Decision Matrix

TECHNOLOGY CHOICES WITH JUSTIFICATION

┌─────────────────────────────────────────────────────────────────────────┐
│  Component          │ Technology      │ Justification                   │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  Primary Database   │ PostgreSQL      │ ACID for preferences, mature,   │
│                     │                 │ excellent JSON support          │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  Message Queue      │ Apache Kafka    │ High throughput, partitioning,  │
│                     │                 │ replay capability, ordering     │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  Cache              │ Redis Cluster   │ Sub-ms latency, pub/sub for     │
│                     │                 │ real-time, rate limiting        │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  Analytics/Audit    │ ClickHouse      │ Columnar, excellent for time    │
│                     │                 │ series, fast aggregations       │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  Archive Storage    │ S3 + Glacier    │ Cost-effective, durable,        │
│                     │                 │ compliance-ready                │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  Real-time          │ Custom WebSocket│ Full control, lower latency     │
│                     │ Gateway         │ than third-party solutions      │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  Push Providers     │ FCM + APNs      │ Official, required for          │
│                     │                 │ iOS/Android                     │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  Email Provider     │ SendGrid        │ High deliverability, good API,  │
│                     │ (primary)       │ webhook support                 │
│                     │ SES (backup)    │ Cost-effective backup           │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  SMS Provider       │ Twilio          │ Global coverage, good API,      │
│                     │ (primary)       │ delivery receipts               │
│                     │ SNS (backup)    │ Cost-effective for US           │
├─────────────────────┼─────────────────┼─────────────────────────────────┤
│  Service Framework  │ FastAPI/Python  │ Async support, good for I/O     │
│                     │ or Go           │ bound work, team familiarity    │
└─────────────────────┴─────────────────┴─────────────────────────────────┘

8.2 Why Kafka?

WHY KAFKA FOR NOTIFICATION PIPELINE

Requirements:
├── High throughput (10K+ messages/sec)
├── Multiple consumers per message type
├── Replay capability for debugging
├── Ordering within user/notification
└── Multiple priority levels

Kafka fits because:
├── Partitioning: Partition by user_id for ordering
├── Consumer groups: Multiple workers per channel
├── Retention: Replay failed notifications
├── Topics: Separate by priority (critical, high, medium, low)
└── Throughput: Millions of messages/sec

Alternative considered: RabbitMQ
├── Better for low-latency, complex routing
├── But: No replay, harder to scale
├── Verdict: Kafka better for our volume


KAFKA TOPIC DESIGN:

notifications.critical    (partition by user_id, 32 partitions)
├── Security alerts
└── Fraud notifications

notifications.high        (partition by user_id, 64 partitions)
├── Transaction notifications
└── Payment confirmations

notifications.medium      (partition by user_id, 32 partitions)
├── Social notifications
└── Reminders

notifications.low         (partition by user_id, 16 partitions)
├── Marketing
└── Newsletters

channel.push             (partition by device_token hash, 128 partitions)
channel.email            (partition by user_id, 64 partitions)
channel.sms              (partition by phone_number, 32 partitions)
channel.inapp            (partition by user_id, 32 partitions)

dlq.push                 (dead letter queue for failed push)
dlq.email                (dead letter queue for failed email)
dlq.sms                  (dead letter queue for failed SMS)

Part V: Schema Design

Chapter 9: Database Schema

9.1 Core Tables

-- =============================================================================
-- NOTIFICATION PLATFORM SCHEMA
-- =============================================================================

-- -----------------------------------------------------------------------------
-- Users & Preferences
-- -----------------------------------------------------------------------------

CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    email VARCHAR(255),
    phone_number VARCHAR(20),
    timezone VARCHAR(50) DEFAULT 'UTC',
    locale VARCHAR(10) DEFAULT 'en-US',
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_phone ON users(phone_number);


CREATE TABLE user_preferences (
    user_id UUID PRIMARY KEY REFERENCES users(user_id),
    
    -- Global settings
    notifications_enabled BOOLEAN DEFAULT true,
    
    -- Channel preferences (JSON for flexibility)
    channel_preferences JSONB DEFAULT '{
        "push": {"enabled": true, "priority": 1},
        "email": {"enabled": true, "priority": 2},
        "sms": {"enabled": true, "priority": 3},
        "in_app": {"enabled": true, "priority": 1}
    }',
    
    -- Category preferences (JSON)
    category_preferences JSONB DEFAULT '{
        "transaction": {"enabled": true, "channels": ["push", "email"]},
        "security": {"enabled": true, "channels": ["push", "sms", "email"]},
        "marketing": {"enabled": true, "channels": ["email"]},
        "social": {"enabled": true, "channels": ["push", "in_app"]},
        "reminder": {"enabled": true, "channels": ["push", "email"]}
    }',
    
    -- Quiet hours
    quiet_hours_enabled BOOLEAN DEFAULT false,
    quiet_hours_start TIME,
    quiet_hours_end TIME,
    
    -- Frequency caps
    max_notifications_per_day INTEGER,
    max_marketing_per_week INTEGER DEFAULT 3,
    
    -- Version for optimistic locking (Week 5: Consistency)
    version INTEGER DEFAULT 1,
    
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);


-- -----------------------------------------------------------------------------
-- Device Tokens (for Push Notifications)
-- -----------------------------------------------------------------------------

CREATE TYPE device_platform AS ENUM ('ios', 'android', 'web');
CREATE TYPE token_status AS ENUM ('active', 'invalid', 'expired');

CREATE TABLE device_tokens (
    token_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(user_id),
    
    platform device_platform NOT NULL,
    token VARCHAR(512) NOT NULL,  -- FCM/APNs token
    
    -- Metadata
    device_id VARCHAR(255),
    app_version VARCHAR(20),
    os_version VARCHAR(20),
    device_model VARCHAR(100),
    
    -- Status tracking
    status token_status DEFAULT 'active',
    last_used_at TIMESTAMP,
    invalidated_at TIMESTAMP,
    invalidation_reason VARCHAR(255),
    
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
    
    UNIQUE(user_id, token)
);

CREATE INDEX idx_device_tokens_user ON device_tokens(user_id);
CREATE INDEX idx_device_tokens_status ON device_tokens(status) WHERE status = 'active';
CREATE INDEX idx_device_tokens_token ON device_tokens(token);


-- -----------------------------------------------------------------------------
-- Email Addresses (with validation status)
-- -----------------------------------------------------------------------------

CREATE TYPE email_status AS ENUM ('active', 'bounced', 'complained', 'unsubscribed');

CREATE TABLE user_emails (
    email_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(user_id),
    email VARCHAR(255) NOT NULL,
    
    -- Status
    status email_status DEFAULT 'active',
    verified BOOLEAN DEFAULT false,
    verified_at TIMESTAMP,
    
    -- Bounce tracking
    bounce_count INTEGER DEFAULT 0,
    last_bounce_at TIMESTAMP,
    bounce_type VARCHAR(50),  -- hard, soft
    
    -- Engagement tracking
    last_sent_at TIMESTAMP,
    last_opened_at TIMESTAMP,
    last_clicked_at TIMESTAMP,
    
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_user_emails_user ON user_emails(user_id);
CREATE INDEX idx_user_emails_email ON user_emails(email);
CREATE INDEX idx_user_emails_status ON user_emails(status);


-- -----------------------------------------------------------------------------
-- Phone Numbers (with validation status)
-- -----------------------------------------------------------------------------

CREATE TYPE phone_status AS ENUM ('active', 'invalid', 'opted_out');

CREATE TABLE user_phones (
    phone_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(user_id),
    phone_number VARCHAR(20) NOT NULL,  -- E.164 format
    country_code VARCHAR(3),
    
    -- Status
    status phone_status DEFAULT 'active',
    verified BOOLEAN DEFAULT false,
    verified_at TIMESTAMP,
    
    -- Carrier info (for routing)
    carrier VARCHAR(100),
    phone_type VARCHAR(20),  -- mobile, landline, voip
    
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_user_phones_user ON user_phones(user_id);
CREATE INDEX idx_user_phones_number ON user_phones(phone_number);


-- -----------------------------------------------------------------------------
-- Templates
-- -----------------------------------------------------------------------------

CREATE TYPE template_channel AS ENUM ('push', 'email', 'sms', 'in_app', 'whatsapp');

CREATE TABLE templates (
    template_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    
    -- Identification
    name VARCHAR(100) NOT NULL,  -- e.g., "payment_received"
    channel template_channel NOT NULL,
    version INTEGER DEFAULT 1,
    
    -- Content
    subject VARCHAR(255),  -- For email
    title VARCHAR(255),    -- For push/in-app
    body TEXT NOT NULL,
    
    -- For rich content
    html_body TEXT,        -- For email HTML
    image_url VARCHAR(500),
    action_url VARCHAR(500),
    action_buttons JSONB,  -- For push actions
    
    -- Variables schema (for validation)
    variables_schema JSONB,
    
    -- Localization
    locale VARCHAR(10) DEFAULT 'en-US',
    
    -- Metadata
    category VARCHAR(50),
    is_active BOOLEAN DEFAULT true,
    
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
    
    UNIQUE(name, channel, locale, version)
);

CREATE INDEX idx_templates_name ON templates(name);
CREATE INDEX idx_templates_channel ON templates(channel);
CREATE INDEX idx_templates_active ON templates(is_active) WHERE is_active = true;


-- -----------------------------------------------------------------------------
-- Notifications (Partitioned by created_at for performance)
-- -----------------------------------------------------------------------------

CREATE TYPE notification_type AS ENUM (
    'transaction', 'security', 'marketing', 'reminder', 'social'
);

CREATE TYPE notification_priority AS ENUM ('critical', 'high', 'medium', 'low');

CREATE TYPE notification_status AS ENUM (
    'pending',           -- Just created
    'queued',            -- In Kafka
    'processing',        -- Being processed
    'sent',              -- Sent to provider
    'delivered',         -- Confirmed delivered
    'failed',            -- Delivery failed
    'bounced',           -- Email bounced
    'clicked',           -- User clicked
    'read'               -- User read (in-app)
);

CREATE TABLE notifications (
    notification_id UUID NOT NULL,
    user_id UUID NOT NULL,
    
    -- Classification
    type notification_type NOT NULL,
    priority notification_priority NOT NULL,
    category VARCHAR(50),
    
    -- Content
    template_id UUID REFERENCES templates(template_id),
    variables JSONB,
    
    -- Rendered content (cached after first render)
    rendered_title VARCHAR(255),
    rendered_body TEXT,
    
    -- Delivery
    channel template_channel,
    status notification_status DEFAULT 'pending',
    
    -- Tracking
    idempotency_key VARCHAR(255),  -- For deduplication
    external_id VARCHAR(255),      -- Provider's ID
    
    -- Scheduling
    scheduled_at TIMESTAMP,
    
    -- Timing
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    queued_at TIMESTAMP,
    sent_at TIMESTAMP,
    delivered_at TIMESTAMP,
    read_at TIMESTAMP,
    
    -- Error tracking
    error_code VARCHAR(50),
    error_message TEXT,
    retry_count INTEGER DEFAULT 0,
    
    -- Metadata
    metadata JSONB,
    
    PRIMARY KEY (notification_id, created_at)
) PARTITION BY RANGE (created_at);

-- Create partitions (monthly)
CREATE TABLE notifications_2024_01 PARTITION OF notifications
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE notifications_2024_02 PARTITION OF notifications
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- ... continue for each month

-- Indexes on partitioned table
CREATE INDEX idx_notifications_user ON notifications(user_id, created_at DESC);
CREATE INDEX idx_notifications_status ON notifications(status, created_at);
CREATE INDEX idx_notifications_idempotency ON notifications(idempotency_key);


-- -----------------------------------------------------------------------------
-- Transactional Outbox (Week 3: Reliable Publishing)
-- -----------------------------------------------------------------------------

CREATE TABLE notification_outbox (
    outbox_id BIGSERIAL PRIMARY KEY,
    notification_id UUID NOT NULL,
    
    -- Message content
    topic VARCHAR(100) NOT NULL,
    payload JSONB NOT NULL,
    
    -- Processing state
    published BOOLEAN DEFAULT false,
    published_at TIMESTAMP,
    
    -- Ordering
    partition_key VARCHAR(255),  -- Usually user_id
    
    created_at TIMESTAMP NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_outbox_unpublished ON notification_outbox(created_at) 
    WHERE published = false;


-- -----------------------------------------------------------------------------
-- Notification Delivery Attempts (for debugging)
-- -----------------------------------------------------------------------------

CREATE TABLE delivery_attempts (
    attempt_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    notification_id UUID NOT NULL,
    
    channel template_channel NOT NULL,
    provider VARCHAR(50) NOT NULL,  -- fcm, apns, sendgrid, twilio
    
    -- Attempt details
    attempt_number INTEGER NOT NULL,
    
    -- Request/Response
    request_payload JSONB,
    response_code INTEGER,
    response_body TEXT,
    
    -- Timing
    started_at TIMESTAMP NOT NULL,
    completed_at TIMESTAMP,
    duration_ms INTEGER,
    
    -- Result
    success BOOLEAN,
    error_code VARCHAR(50),
    error_message TEXT
);

CREATE INDEX idx_delivery_attempts_notification ON delivery_attempts(notification_id);
CREATE INDEX idx_delivery_attempts_time ON delivery_attempts(started_at);


-- -----------------------------------------------------------------------------
-- Audit Log (for compliance) - Consider using ClickHouse for production
-- -----------------------------------------------------------------------------

CREATE TABLE audit_log (
    log_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    
    -- What happened
    action VARCHAR(50) NOT NULL,  -- created, sent, delivered, failed, etc.
    
    -- Who/what
    notification_id UUID,
    user_id UUID,
    
    -- Details
    details JSONB,
    
    -- Timing
    occurred_at TIMESTAMP NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_audit_log_notification ON audit_log(notification_id);
CREATE INDEX idx_audit_log_user ON audit_log(user_id);
CREATE INDEX idx_audit_log_time ON audit_log(occurred_at);


-- -----------------------------------------------------------------------------
-- Rate Limiting State (using Redis primarily, but backup in PostgreSQL)
-- -----------------------------------------------------------------------------

CREATE TABLE rate_limit_state (
    key VARCHAR(255) PRIMARY KEY,  -- e.g., "user:123:daily" or "provider:sendgrid:minute"
    count INTEGER NOT NULL DEFAULT 0,
    window_start TIMESTAMP NOT NULL,
    window_duration_seconds INTEGER NOT NULL
);

9.2 Redis Schema

REDIS DATA STRUCTURES

User Preferences Cache:
  Key: pref:{user_id}
  Type: Hash
  TTL: 5 minutes
  Fields: global_enabled, channel_prefs (JSON), category_prefs (JSON), quiet_hours
  
  Example:
  HGETALL pref:user-123
  {
    "global_enabled": "true",
    "channel_prefs": "{...}",
    "category_prefs": "{...}",
    "quiet_hours_enabled": "false"
  }


Device Tokens Cache:
  Key: tokens:{user_id}
  Type: Set
  TTL: 1 hour
  Values: JSON objects with token details
  
  Example:
  SMEMBERS tokens:user-123
  [
    '{"token_id":"...", "platform":"ios", "token":"..."}',
    '{"token_id":"...", "platform":"android", "token":"..."}'
  ]


Template Cache:
  Key: template:{name}:{channel}:{locale}
  Type: String (JSON)
  TTL: 10 minutes
  
  Example:
  GET template:payment_received:push:en-US
  '{"template_id":"...", "title":"...", "body":"..."}'


Rate Limiting (Sliding Window):
  Key: ratelimit:{scope}:{identifier}:{window}
  Type: Sorted Set
  Score: Timestamp
  TTL: Window duration + buffer
  
  Example (user daily limit):
  ZADD ratelimit:user:user-123:daily {timestamp} {notification_id}
  ZCOUNT ratelimit:user:user-123:daily {window_start} +inf


In-App Notification Center:
  Key: inbox:{user_id}
  Type: List (capped)
  TTL: None (persistent)
  
  Example:
  LPUSH inbox:user-123 '{notification JSON}'
  LTRIM inbox:user-123 0 99  // Keep last 100


Unread Count:
  Key: unread:{user_id}
  Type: String (counter)
  
  Example:
  INCR unread:user-123
  GET unread:user-123


Online Users (for WebSocket routing):
  Key: online:{user_id}
  Type: String
  Value: WebSocket server ID
  TTL: 30 seconds (heartbeat)
  
  Example:
  SETEX online:user-123 30 "ws-server-1"


Provider Health:
  Key: provider:health:{provider_name}
  Type: Hash
  
  Example:
  HGETALL provider:health:sendgrid
  {
    "status": "healthy",
    "success_rate": "99.5",
    "avg_latency_ms": "150",
    "last_check": "2024-01-15T10:00:00Z"
  }

Part VI: API Design

Chapter 10: API Specifications

10.1 Send Notification API

# POST /v1/notifications
# Send a notification to a user

Request:
  headers:
    X-Request-ID: string        # Idempotency key
    X-Priority: string          # Override default priority (optional)
    
  body:
    user_id: string             # Required: Target user
    type: string                # Required: transaction, security, marketing, etc.
    template: string            # Required: Template name
    variables: object           # Template variables
    channels: string[]          # Optional: Override default channels
    scheduled_at: datetime      # Optional: Schedule for later
    metadata: object            # Optional: Custom metadata
    
Example Request:
  POST /v1/notifications
  X-Request-ID: txn-12345-notification
  
  {
    "user_id": "user-123",
    "type": "transaction",
    "template": "payment_received",
    "variables": {
      "amount": "$50.00",
      "sender_name": "John Doe",
      "transaction_id": "txn-12345"
    },
    "metadata": {
      "source": "payment-service",
      "transaction_id": "txn-12345"
    }
  }

Response (202 Accepted):
  {
    "notification_id": "notif-abc123",
    "status": "queued",
    "created_at": "2024-01-15T10:00:00Z"
  }

Error Responses:
  400 Bad Request:
    - Invalid user_id
    - Unknown template
    - Missing required variables
    
  409 Conflict:
    - Duplicate idempotency key (returns existing notification)
    
  429 Too Many Requests:
    - Rate limit exceeded

10.2 Preferences API

# GET /v1/users/{user_id}/preferences
# Get user notification preferences

Response (200 OK):
  {
    "user_id": "user-123",
    "global_enabled": true,
    "channels": {
      "push": {"enabled": true, "priority": 1},
      "email": {"enabled": true, "priority": 2},
      "sms": {"enabled": true, "priority": 3}
    },
    "categories": {
      "transaction": {"enabled": true, "channels": ["push", "email"]},
      "security": {"enabled": true, "channels": ["push", "sms"]},
      "marketing": {"enabled": false, "channels": []}
    },
    "quiet_hours": {
      "enabled": true,
      "start": "22:00",
      "end": "08:00",
      "timezone": "America/New_York"
    },
    "version": 5,
    "updated_at": "2024-01-15T10:00:00Z"
  }


# PATCH /v1/users/{user_id}/preferences
# Update user preferences (partial update)

Request:
  headers:
    If-Match: "5"              # Optimistic locking version
    
  body:
    # Only include fields to update
    categories:
      marketing:
        enabled: false
    quiet_hours:
      enabled: true
      start: "23:00"

Response (200 OK):
  {
    "user_id": "user-123",
    "version": 6,
    "updated_at": "2024-01-15T10:05:00Z"
  }

Error Responses:
  409 Conflict:
    - Version mismatch (concurrent update)
  412 Precondition Failed:
    - If-Match header missing

10.3 Status API

# GET /v1/notifications/{notification_id}
# Get notification status

Response (200 OK):
  {
    "notification_id": "notif-abc123",
    "user_id": "user-123",
    "type": "transaction",
    "priority": "high",
    "status": "delivered",
    "channel": "push",
    "timeline": [
      {"status": "pending", "at": "2024-01-15T10:00:00.000Z"},
      {"status": "queued", "at": "2024-01-15T10:00:00.050Z"},
      {"status": "sent", "at": "2024-01-15T10:00:00.200Z"},
      {"status": "delivered", "at": "2024-01-15T10:00:01.500Z"}
    ],
    "created_at": "2024-01-15T10:00:00Z",
    "delivered_at": "2024-01-15T10:00:01.500Z"
  }


# GET /v1/users/{user_id}/notifications
# List notifications for a user

Query Parameters:
  status: string[]      # Filter by status
  type: string[]        # Filter by type
  since: datetime       # Notifications after this time
  limit: integer        # Max results (default 50, max 100)
  cursor: string        # Pagination cursor

Response (200 OK):
  {
    "notifications": [...],
    "cursor": "eyJsYXN0X2lkIjogIm5vdGlmLXh5eiJ9",
    "has_more": true
  }

Summary

What We Covered Today

┌────────────────────────────────────────────────────────────────────────┐
│                    DAY 1 SUMMARY                                       │
│                                                                        │
│  INTERVIEW APPROACH                                                    │
│  ├── FERNS framework for requirements                                  │
│  ├── Don't start drawing immediately                                   │
│  ├── Clarifying questions dialogue                                     │
│  └── Summarize before designing                                        │
│                                                                        │
│  DOMAIN KNOWLEDGE                                                      │
│  ├── Notification types and priorities                                 │
│  ├── Channel characteristics (push, email, SMS, in-app)                │
│  ├── Provider comparison and costs                                     │
│  └── Regulatory compliance (CAN-SPAM, GDPR, TCPA)                      │
│                                                                        │
│  ESTIMATION                                                            │
│  ├── 500M notifications/day, 10K/sec peak                              │
│  ├── 30TB active storage                                               │
│  ├── SMS is expensive ($3M+/month)                                     │
│  └── Optimize for push > email > SMS                                   │
│                                                                        │
│  ARCHITECTURE                                                          │
│  ├── API Layer (Send, Preferences, Status, Templates)                  │
│  ├── Processing Layer (Validator, Router, Enricher)                    │
│  ├── Delivery Layer (Channel-specific workers)                         │
│  ├── Kafka for message queue (priority topics)                         │
│  └── Provider abstraction for flexibility                              │
│                                                                        │
│  SCHEMA DESIGN                                                         │
│  ├── PostgreSQL for core data (partitioned notifications)              │
│  ├── Redis for caching and real-time                                   │
│  ├── Transactional outbox for reliable publishing                      │
│  └── Optimistic locking for preferences                                │
│                                                                        │
│  WEEK 1-5 CONCEPTS APPLIED                                             │
│  ├── Partitioning: Kafka topics, PostgreSQL partitions                 │
│  ├── Transactional outbox: Reliable event publishing                   │
│  ├── Optimistic locking: Preference updates                            │
│  └── Priority queues: Critical > High > Medium > Low                   │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

What's Coming Tomorrow

Day 2: Core Notification Flow"The happy path must be bulletproof"

We'll implement:

  • Complete notification ingestion with validation
  • Routing logic with preference handling
  • Provider integration (FCM, APNs, SendGrid, Twilio)
  • Delivery tracking and status management
  • Full code implementation

Interview Tip of the Day

┌────────────────────────────────────────────────────────────────────────┐
│                    INTERVIEW TIP                                       │
│                                                                        │
│  "THE 5-MINUTE RULE"                                                   │
│                                                                        │
│  Spend the first 5 minutes ONLY asking questions.                      │
│  Don't draw anything. Don't mention technology.                        │
│                                                                        │
│  Why?                                                                  │
│  1. Shows you think before acting (senior behavior)                    │
│  2. Uncovers hidden requirements                                       │
│  3. Prevents wasted time on wrong assumptions                          │
│  4. Builds rapport with interviewer                                    │
│                                                                        │
│  What to ask:                                                          │
│  - Scale (users, requests, data)                                       │
│  - Latency requirements                                                │
│  - Consistency requirements                                            │
│  - Failure handling priorities                                         │
│  - What's already built vs greenfield                                  │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

End of Week 6, Day 1

Tomorrow: Day 2 — Core Notification Flow