Himanshu Kukreja
0%
LearnSystem DesignWeek 6Designing Notification Platform Preview
Week Preview

Week 6 Preview: Designing a Notification Platform

🎯 One System, Five Days, Complete Mastery


Week 6 Philosophy

Unlike Weeks 1-5 where each day covered a new concept, Week 6-8 are immersive practical weeks. We take ONE complex real-world system and spend the entire week designing it end-to-end.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    WEEK 6-8 APPROACH                                   β”‚
β”‚                                                                        β”‚
β”‚  WEEKS 1-5: Learn concepts                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”                               β”‚
β”‚  β”‚Day 1β”‚ β”‚Day 2β”‚ β”‚Day 3β”‚ β”‚Day 4β”‚ β”‚Day 5β”‚                               β”‚
β”‚  β”‚Topicβ”‚ β”‚Topicβ”‚ β”‚Topicβ”‚ β”‚Topicβ”‚ β”‚Topicβ”‚                               β”‚
β”‚  β”‚  A  β”‚ β”‚  B  β”‚ β”‚  C  β”‚ β”‚  D  β”‚ β”‚  E  β”‚                               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜                               β”‚
β”‚                                                                        β”‚
β”‚  WEEKS 6-8: Apply concepts to ONE real system                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                    NOTIFICATION PLATFORM                       β”‚    β”‚
β”‚  β”‚  Day 1      Day 2      Day 3      Day 4      Day 5             β”‚    β”‚
β”‚  β”‚  Problem    Core       Advanced   Scale &    Operations        β”‚    β”‚
β”‚  β”‚  & Design   Flows      Features   Edge Cases & Interview       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why Notification Platform?

A notification platform is the perfect teaching system because it touches EVERYTHING:

CONCEPTS FROM WEEKS 1-5 APPLIED:

Week 1 (Data at Scale):
β”œβ”€β”€ Partitioning notification queues
β”œβ”€β”€ Replication for delivery guarantees
β”œβ”€β”€ Rate limiting per user/channel
β”œβ”€β”€ Hot key handling (celebrity notifications)
└── Session management for push tokens

Week 2 (Failure-First Design):
β”œβ”€β”€ Timeout management with external providers
β”œβ”€β”€ Idempotency for duplicate prevention
β”œβ”€β”€ Circuit breakers for failing providers
β”œβ”€β”€ Retry strategies per channel
└── Dead letter handling

Week 3 (Messaging & Async):
β”œβ”€β”€ Queue vs stream for notification pipeline
β”œβ”€β”€ Transactional outbox for reliable publishing
β”œβ”€β”€ Backpressure from external providers
β”œβ”€β”€ Dead letter queues for failed notifications
└── Audit logging for compliance

Week 4 (Caching):
β”œβ”€β”€ User preference caching
β”œβ”€β”€ Template caching
β”œβ”€β”€ Device token caching
β”œβ”€β”€ Rate limit counter caching
└── Provider health caching

Week 5 (Consistency & Coordination):
β”œβ”€β”€ Consistency for preference updates
β”œβ”€β”€ Saga for multi-channel notifications
β”œβ”€β”€ Workflow orchestration for complex flows
β”œβ”€β”€ Conflict resolution for preference sync
└── Leader election for batch processors

The Problem Statement

╔══════════════════════════════════════════════════════════════════════════╗
β•‘                                                                          β•‘
β•‘              Design a Multi-Channel Notification Platform                β•‘
β•‘                                                                          β•‘
β•‘   You're building the notification infrastructure for a fintech          β•‘
β•‘   super-app (like Revolut, Cash App, or PayTM). Users receive            β•‘
β•‘   notifications about:                                                   β•‘
β•‘                                                                          β•‘
β•‘   β€’ Transactions (payments, transfers, refunds)                          β•‘
β•‘   β€’ Security alerts (login, password change, suspicious activity)        β•‘
β•‘   β€’ Marketing campaigns (promotions, new features)                       β•‘
β•‘   β€’ Reminders (bills due, low balance, scheduled payments)               β•‘
β•‘   β€’ Social (friend requests, splits, payment requests)                   β•‘
β•‘                                                                          β•‘
β•‘   Channels: Push, Email, SMS, In-App, WhatsApp                           β•‘
β•‘                                                                          β•‘
β•‘   Scale:                                                                 β•‘
β•‘   β€’ 50M users                                                            β•‘
β•‘   β€’ 500M notifications/day                                               β•‘
β•‘   β€’ 10M notifications/hour during campaigns                              β•‘
β•‘   β€’ 99.9% delivery SLA for critical notifications                        β•‘
β•‘                                                                          β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Week 6 Daily Breakdown

Day 1: Problem Understanding & High-Level Design

Theme: "Before you solve it, understand it deeply"

WHAT WE'LL COVER:

1. INTERVIEW APPROACH
   β”œβ”€β”€ How to clarify requirements (with sample dialogue)
   β”œβ”€β”€ Functional vs non-functional requirements
   β”œβ”€β”€ Asking the RIGHT questions
   └── Establishing constraints and priorities

2. DOMAIN DEEP DIVE
   β”œβ”€β”€ Notification types and their characteristics
   β”œβ”€β”€ Channel comparison (push vs email vs SMS vs...)
   β”œβ”€β”€ User preferences and consent management
   └── Regulatory requirements (GDPR, CAN-SPAM, etc.)

3. BACK OF ENVELOPE ESTIMATION
   β”œβ”€β”€ Traffic patterns (steady state vs campaigns)
   β”œβ”€β”€ Storage requirements
   β”œβ”€β”€ Provider costs and limits
   └── Infrastructure sizing

4. HIGH-LEVEL ARCHITECTURE
   β”œβ”€β”€ Component identification
   β”œβ”€β”€ Data flow design
   β”œβ”€β”€ Technology choices with justification
   └── API design (ingestion, preferences, status)

5. SCHEMA DESIGN
   β”œβ”€β”€ Notification schema
   β”œβ”€β”€ User preferences schema
   β”œβ”€β”€ Delivery status schema
   └── Audit log schema

WEEK 1-5 CONCEPTS APPLIED:
β€’ Partitioning strategy (Week 1)
β€’ Queue vs stream decision (Week 3)
β€’ Consistency model for preferences (Week 5)

Day 2: Core Notification Flow

Theme: "The happy path must be bulletproof"

WHAT WE'LL COVER:

1. NOTIFICATION INGESTION
   β”œβ”€β”€ API design for sending notifications
   β”œβ”€β”€ Validation and enrichment
   β”œβ”€β”€ Priority classification
   └── Idempotency handling

2. ROUTING AND CHANNEL SELECTION
   β”œβ”€β”€ User preference lookup
   β”œβ”€β”€ Channel eligibility rules
   β”œβ”€β”€ Fallback channel logic
   └── Time-zone aware delivery

3. QUEUE ARCHITECTURE
   β”œβ”€β”€ Topic/partition design
   β”œβ”€β”€ Priority queues
   β”œβ”€β”€ Per-channel queues
   └── Ordering guarantees

4. PROVIDER INTEGRATION
   β”œβ”€β”€ Provider abstraction layer
   β”œβ”€β”€ Push: FCM, APNs integration
   β”œβ”€β”€ Email: SendGrid, SES integration
   β”œβ”€β”€ SMS: Twilio, SNS integration
   └── WhatsApp: Business API integration

5. DELIVERY TRACKING
   β”œβ”€β”€ Status state machine
   β”œβ”€β”€ Delivery receipts and callbacks
   β”œβ”€β”€ Bounce handling
   └── Read receipts (where available)

6. IMPLEMENTATION
   β”œβ”€β”€ Complete code for notification service
   β”œβ”€β”€ Provider integration code
   β”œβ”€β”€ Database operations
   └── Unit and integration tests

WEEK 1-5 CONCEPTS APPLIED:
β€’ Transactional outbox (Week 3)
β€’ Idempotency patterns (Week 2)
β€’ Rate limiting (Week 1)
β€’ Timeout management (Week 2)

Day 3: Advanced Features

Theme: "The features that separate good from great"

WHAT WE'LL COVER:

1. TEMPLATE SYSTEM
   β”œβ”€β”€ Template storage and versioning
   β”œβ”€β”€ Personalization and variables
   β”œβ”€β”€ Localization (i18n)
   β”œβ”€β”€ A/B testing support
   └── Template preview and validation

2. BATCHING AND DIGESTS
   β”œβ”€β”€ When to batch (trading notifications)
   β”œβ”€β”€ Digest generation (daily summary)
   β”œβ”€β”€ Smart batching algorithms
   └── User-configurable digest preferences

3. SCHEDULING AND DELAYED DELIVERY
   β”œβ”€β”€ Scheduled notifications
   β”œβ”€β”€ Time-zone aware scheduling
   β”œβ”€β”€ "Best time to send" algorithms
   β”œβ”€β”€ Reminder workflows
   └── Cancellation handling

4. MULTI-CHANNEL ORCHESTRATION
   β”œβ”€β”€ Saga pattern for multi-channel
   β”œβ”€β”€ Fallback chains (push β†’ SMS β†’ email)
   β”œβ”€β”€ Escalation workflows
   └── Channel coordination

5. USER PREFERENCE MANAGEMENT
   β”œβ”€β”€ Preference hierarchy (global β†’ category β†’ channel)
   β”œβ”€β”€ Quiet hours / Do Not Disturb
   β”œβ”€β”€ Frequency capping
   └── Unsubscribe handling

6. REAL-TIME FEATURES
   β”œβ”€β”€ In-app notification center
   β”œβ”€β”€ WebSocket delivery
   β”œβ”€β”€ Read/unread state sync
   └── Notification grouping

WEEK 1-5 CONCEPTS APPLIED:
β€’ Saga pattern (Week 5)
β€’ Workflow orchestration (Week 5)
β€’ Caching strategies (Week 4)
β€’ Conflict resolution for preferences (Week 5)

Day 4: Scale, Reliability & Edge Cases

Theme: "What breaks at scale? Everything."

WHAT WE'LL COVER:

1. SCALING CHALLENGES
   β”œβ”€β”€ Campaign mode (10M notifications in 1 hour)
   β”œβ”€β”€ Hot user problem (celebrity with 1M followers)
   β”œβ”€β”€ Provider rate limits
   └── Database bottlenecks

2. RELIABILITY PATTERNS
   β”œβ”€β”€ Circuit breakers per provider
   β”œβ”€β”€ Retry strategies with backoff
   β”œβ”€β”€ Dead letter queue processing
   β”œβ”€β”€ Provider failover
   └── Graceful degradation

3. EDGE CASES (THE HARD STUFF)
   β”œβ”€β”€ User uninstalls app mid-notification
   β”œβ”€β”€ Phone number/email changes
   β”œβ”€β”€ Device token rotation
   β”œβ”€β”€ Duplicate device registrations
   β”œβ”€β”€ Time zone edge cases (DST)
   β”œβ”€β”€ Provider outages
   β”œβ”€β”€ Partial delivery (multi-channel)
   └── Race conditions in preferences

4. FAILURE SCENARIOS
   β”œβ”€β”€ Database failure
   β”œβ”€β”€ Queue failure
   β”œβ”€β”€ Provider failure (all channels)
   β”œβ”€β”€ Network partition
   └── Cascading failures

5. DATA CONSISTENCY
   β”œβ”€β”€ Exactly-once delivery (is it possible?)
   β”œβ”€β”€ At-least-once with deduplication
   β”œβ”€β”€ Preference consistency across devices
   └── Status consistency

6. COST OPTIMIZATION
   β”œβ”€β”€ Provider cost comparison
   β”œβ”€β”€ Batching for cost reduction
   β”œβ”€β”€ Smart channel selection
   └── Reducing unnecessary notifications

WEEK 1-5 CONCEPTS APPLIED:
β€’ Circuit breakers (Week 2)
β€’ Backpressure handling (Week 3)
β€’ Dead letter queues (Week 3)
β€’ Hot key mitigation (Week 1)
β€’ Leader election for processors (Week 5)

Day 5: Operations, Monitoring & Interview Mastery

Theme: "Ship it, run it, ace the interview"

WHAT WE'LL COVER:

1. OBSERVABILITY
   β”œβ”€β”€ Key metrics (delivery rate, latency, cost)
   β”œβ”€β”€ Distributed tracing setup
   β”œβ”€β”€ Log aggregation
   └── Custom dashboards

2. ALERTING STRATEGY
   β”œβ”€β”€ SLOs and SLIs definition
   β”œβ”€β”€ Alert hierarchy (critical/warning/info)
   β”œβ”€β”€ On-call runbooks
   └── Incident response

3. OPERATIONAL TOOLING
   β”œβ”€β”€ Admin dashboard features
   β”œβ”€β”€ Notification search and debugging
   β”œβ”€β”€ Manual retry interface
   β”œβ”€β”€ Provider health dashboard
   └── Cost monitoring

4. DEPLOYMENT & ROLLOUT
   β”œβ”€β”€ Canary deployment
   β”œβ”€β”€ Feature flags for new channels
   β”œβ”€β”€ Database migrations
   └── Rollback procedures

5. INTERVIEW WALKTHROUGH
   β”œβ”€β”€ Complete 45-minute interview simulation
   β”œβ”€β”€ Common interviewer questions
   β”œβ”€β”€ How to handle curveballs
   β”œβ”€β”€ Trade-off discussions
   └── What NOT to say

6. REAL-WORLD CASE STUDIES
   β”œβ”€β”€ How Uber built their notification platform
   β”œβ”€β”€ How Slack handles notifications
   β”œβ”€β”€ How WhatsApp scaled messaging
   └── Lessons from production incidents

7. COMPLETE SYSTEM SUMMARY
   β”œβ”€β”€ Architecture diagram (final)
   β”œβ”€β”€ Component interaction matrix
   β”œβ”€β”€ Decision log with rationale
   └── Future improvements roadmap

WEEK 1-5 CONCEPTS APPLIED:
β€’ All concepts integrated
β€’ Full system thinking
β€’ Production readiness

What Makes This Week Different

1. Interview-Focused Approach

Each day includes dialogue showing HOW to present in an interview:

EXAMPLE FROM DAY 2:

**Interviewer**: "Walk me through how a notification gets sent."

**You**: "Let me trace a transaction notification end-to-end.

First, when a payment completes, the payment service publishes a 
PaymentCompleted event. I'd use the transactional outbox pattern 
here β€” we write the event to an outbox table in the same transaction 
as the payment, then a separate process publishes to Kafka.

The notification service consumes this event and..."

2. Production Reality

We don't just design β€” we discuss what ACTUALLY breaks:

EXAMPLE FROM DAY 4:

EDGE CASE: Device Token Rotation

Problem:
  iOS rotates device tokens periodically
  Old token stored in our database
  Push notification fails with "InvalidToken"
  
If we just retry:
  Same failure, wasted resources
  User never gets notification
  
Production solution:
  1. Detect InvalidToken response
  2. Mark token as invalid in DB
  3. Route to fallback channel (email/SMS)
  4. When app opens next, register new token
  5. Have background job to clean stale tokens

3. Complete Implementation

Not pseudo-code β€” production-ready patterns:

# Example from Day 2: Provider abstraction

class NotificationProvider(ABC):
    """Base class for notification providers."""
    
    @abstractmethod
    async def send(self, notification: Notification) -> DeliveryResult:
        """Send notification through this provider."""
        pass
    
    @abstractmethod
    async def check_health(self) -> HealthStatus:
        """Check provider health."""
        pass
    
    @abstractmethod
    def get_rate_limit(self) -> RateLimit:
        """Get current rate limit info."""
        pass


class FCMProvider(NotificationProvider):
    """Firebase Cloud Messaging provider."""
    
    async def send(self, notification: Notification) -> DeliveryResult:
        # Full implementation with error handling,
        # retries, token validation, etc.
        ...

4. Edge Case Exhaustiveness

Every edge case you might face in production OR interviews:

EDGE CASES WE'LL COVER:

Notification Creation:
β”œβ”€β”€ Duplicate notification requests
β”œβ”€β”€ Invalid user ID
β”œβ”€β”€ User doesn't exist
β”œβ”€β”€ Missing required fields
β”œβ”€β”€ Template not found
└── Invalid channel specified

Delivery:
β”œβ”€β”€ Invalid device token
β”œβ”€β”€ Expired device token
β”œβ”€β”€ User unsubscribed
β”œβ”€β”€ Provider timeout
β”œβ”€β”€ Provider rate limited
β”œβ”€β”€ Provider returns unknown error
β”œβ”€β”€ Partial multi-channel delivery
└── Network failure mid-delivery

User State:
β”œβ”€β”€ User deletes account mid-notification
β”œβ”€β”€ User changes email during send
β”œβ”€β”€ User in multiple time zones
β”œβ”€β”€ User preferences change during send
β”œβ”€β”€ User blocks sender
└── User marks as spam

System:
β”œβ”€β”€ Database failover during write
β”œβ”€β”€ Kafka partition rebalance
β”œβ”€β”€ Worker crashes mid-processing
β”œβ”€β”€ Clock skew between services
β”œβ”€β”€ Memory pressure
└── Provider certificate expiry

Concepts Mapping

Here's how Week 1-5 concepts map to Week 6:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CONCEPT APPLICATION MAP                              β”‚
β”‚                                                                         β”‚
β”‚  CONCEPT                      β”‚ WHERE APPLIED IN NOTIFICATION PLATFORM  β”‚
β”‚  ────────────────────────────┼─────────────────────────────────────────│
β”‚                               β”‚                                         β”‚
β”‚  WEEK 1: DATA AT SCALE        β”‚                                         β”‚
β”‚  Partitioning                 β”‚ Kafka topics by priority/channel        β”‚
β”‚  Replication                  β”‚ PostgreSQL replicas for reads           β”‚
β”‚  Rate Limiting                β”‚ Per-user, per-channel, per-provider     β”‚
β”‚  Hot Keys                     β”‚ Celebrity notifications fan-out         β”‚
β”‚  Session Store                β”‚ Device token management                 β”‚
β”‚                               β”‚                                         β”‚
β”‚  WEEK 2: FAILURE-FIRST        β”‚                                         β”‚
β”‚  Timeouts                     β”‚ Provider API timeouts                   β”‚
β”‚  Idempotency                  β”‚ Deduplication keys                      β”‚
β”‚  Circuit Breakers             β”‚ Per-provider circuit breakers           β”‚
β”‚  Retries                      β”‚ Exponential backoff for failures        β”‚
β”‚                               β”‚                                         β”‚
β”‚  WEEK 3: MESSAGING            β”‚                                         β”‚
β”‚  Queue vs Stream              β”‚ Kafka for notifications pipeline        β”‚
β”‚  Transactional Outbox         β”‚ Reliable event publishing               β”‚
β”‚  Backpressure                 β”‚ Provider rate limit handling            β”‚
β”‚  Dead Letter Queue            β”‚ Failed notification handling            β”‚
β”‚  Audit Log                    β”‚ Notification audit trail                β”‚
β”‚                               β”‚                                         β”‚
β”‚  WEEK 4: CACHING              β”‚                                         β”‚
β”‚  Cache Patterns               β”‚ User preferences cache                  β”‚
β”‚  Invalidation                 β”‚ Preference change propagation           β”‚
β”‚  Thundering Herd              β”‚ Template cache warming                  β”‚
β”‚  Multi-Tier                   β”‚ Local + Redis + DB                      β”‚
β”‚                               β”‚                                         β”‚
β”‚  WEEK 5: COORDINATION         β”‚                                         β”‚
β”‚  Consistency                  β”‚ Preference read-your-writes             β”‚
β”‚  Saga                         β”‚ Multi-channel delivery                  β”‚
β”‚  Workflow                     β”‚ Complex notification flows              β”‚
β”‚  Conflict Resolution          β”‚ Preference sync across devices          β”‚
β”‚  Leader Election              β”‚ Batch processor coordination            β”‚
β”‚                               β”‚                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Expected Outcomes

By the end of Week 6, you will:

βœ“ Design a notification platform from scratch in 45 minutes
βœ“ Handle ANY edge case interviewer throws at you
βœ“ Explain trade-offs with confidence
βœ“ Write production-quality code for key components
βœ“ Debug notification delivery issues
βœ“ Understand real-world notification systems (Uber, Slack, etc.)
βœ“ Size infrastructure correctly
βœ“ Design monitoring and alerting
βœ“ Handle multi-provider failover
βœ“ Implement preference management correctly

Ready to Start?

Day 1 begins with understanding the problem deeply and creating the high-level design.

We'll approach it exactly as you would in an interview β€” clarifying requirements, estimating scale, and making initial architecture decisions.

Let's build a world-class notification platform! πŸš€


Week 6 of the System Design Mastery Series