Bonus Problem 1: India's UPI
The World's Largest Real-Time Payment System
๐ฎ๐ณ A Revolution That Changed a Nation
In 2016, India launched an experiment. Could a country of 1.4 billion people, many unbanked, leapfrog decades of payment infrastructure and go directly to real-time digital payments?
Eight years later, the answer is extraordinary.
THE NUMBERS THAT DEFINE UPI (2025)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ DAILY TRANSACTIONS MONTHLY TRANSACTIONS โ
โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ 640+ Million 20+ Billion โ
โ โ
โ ANNUAL TRANSACTIONS ANNUAL VALUE โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ 250+ Billion $3.4+ Trillion (โน247 Lakh Crore) โ
โ โ
โ PARTICIPATING BANKS ACTIVE USERS โ
โ โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ 680+ 500+ Million โ
โ โ
โ AVERAGE LATENCY SUCCESS RATE โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ ~270 milliseconds 99.2% โ
โ โ
โ GLOBAL SHARE COUNTRIES ACCEPTING UPI โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โ
โ 50% of world's 8+ (Singapore, UAE, France, โ
โ digital transactions Nepal, Bhutan, Sri Lanka...) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
For context:
โข UPI processes MORE transactions than Visa and Mastercard COMBINED in India
โข A tea vendor in rural India uses the same system as a Fortune 500 company
โข Transactions as small as โน1 (1 cent) work with the same reliability as โน1 Crore
โข The system operates 24/7/365 with no maintenance windows
This is the system we'll design today.
The Interview Begins
You're interviewing at a fintech company. The principal architect draws on the whiteboard:
Interviewer: "India's UPI is considered one of the greatest achievements in financial technology. Countries around the world are trying to replicate it. Today, I want you to design a system like UPI from scratch."
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Design a Real-Time Inter-Bank Payment System โ
โ โ
โ Build the infrastructure that enables instant money transfers โ
โ between ANY two bank accounts in a country using just a phone. โ
โ โ
โ Requirements: โ
โ โข Support 500+ banks with different legacy systems โ
โ โข Handle 600+ million transactions per day โ
โ โข Complete each transaction in < 2 seconds end-to-end โ
โ โข 99.9% availability (< 9 hours downtime/year) โ
โ โข Zero tolerance for money loss (atomic transactions) โ
โ โข Work on basic smartphones with 2G/3G connectivity โ
โ โข Support both P2P (person-to-person) and P2M (person-to-merchant) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Interviewer: "This is arguably one of the hardest system design problems โ you're building financial infrastructure for a nation. Take your time."
Phase 1: Requirements Clarification
You: "Before I start, let me understand the constraints better."
Your Questions
You: "First, what's the relationship between banks and this central system? Do banks connect directly, or through intermediaries?"
Interviewer: "Banks connect through the central switch โ NPCI in India's case. They don't talk to each other directly. The central system orchestrates everything."
You: "What about the mobile apps? Can any app connect to the system?"
Interviewer: "Apps must be approved and must partner with a bank. We call these Payment Service Providers or PSPs. PhonePe partners with Yes Bank, Google Pay with multiple banks. The app itself doesn't hold money โ it's just an interface."
You: "How do users identify each other? Bank account numbers are long and error-prone."
Interviewer: "Great observation. UPI solved this with Virtual Payment Addresses โ like an email for money. username@bankname. The system maps this to actual bank accounts."
You: "What happens if a transaction fails mid-way? Say money is debited but not credited?"
Interviewer: "This is critical. The system MUST be atomic. Either the full transaction succeeds, or it's completely rolled back. Users cannot lose money due to technical failures."
You: "What's the peak traffic pattern? Is it bursty?"
Interviewer: "Very bursty. Evening hours see 2-3x average traffic. Festival seasons like Diwali can see 5x normal load. The system must handle these gracefully."
Requirements Summary
Functional Requirements:
1. USER MANAGEMENT
โข Register users via mobile number + bank account
โข Create and manage Virtual Payment Addresses (VPAs)
โข Link multiple bank accounts to one app
โข Two-factor authentication (device binding + PIN)
2. PAYMENT OPERATIONS
โข Push payments (I send money to you)
โข Pull payments / Collect requests (I request money from you)
โข QR code payments (scan and pay)
โข Recurring payments (autopay/mandates)
3. TRANSACTION PROCESSING
โข Real-time debit from sender's bank
โข Real-time credit to receiver's bank
โข Transaction status tracking
โข Refund/reversal handling
4. BANK INTEGRATION
โข Standardized APIs for all banks
โข VPA to account resolution
โข Balance inquiry (with consent)
โข Account validation
5. SETTLEMENT
โข Net settlement between banks (periodic)
โข Reconciliation and dispute handling
โข Audit trail for compliance
Non-Functional Requirements:
SCALE
โข 600+ million transactions/day
โข 20+ billion transactions/month
โข 500+ million active users
โข 680+ participating banks
LATENCY
โข End-to-end: < 2 seconds (p99)
โข NPCI switch processing: < 300ms
โข Bank response time: < 1 second
AVAILABILITY
โข 99.9% uptime (8.7 hours downtime/year max)
โข 24/7/365 operation
โข No scheduled maintenance windows
CONSISTENCY
โข ACID transactions (money can't be lost)
โข Exactly-once semantics
โข Atomic debit-credit operations
SECURITY
โข End-to-end encryption
โข Device binding
โข Multi-factor authentication
โข Fraud detection in real-time
Phase 2: Back of the Envelope Estimation
You: "Let me work through the numbers..."
Traffic Calculations
TRANSACTIONS PER SECOND
Daily transactions: 640,000,000
Seconds per day: 86,400
Average TPS: ~7,400 TPS
Peak multiplier: 3-5x (evenings, festivals)
Peak TPS: ~25,000-35,000 TPS
Per transaction, multiple operations:
โโโ VPA resolution: 1 lookup
โโโ Sender bank call: 1 API call
โโโ Receiver bank call: 1 API call
โโโ Audit logging: 1-2 writes
โโโ Notifications: 2 pushes
Effective operations/second: ~150,000+ at peak
Data Volume
STORAGE REQUIREMENTS
Per transaction record:
โโโ Transaction ID: 36 bytes (UUID)
โโโ Sender VPA: 50 bytes
โโโ Receiver VPA: 50 bytes
โโโ Amount: 8 bytes
โโโ Timestamps: 16 bytes
โโโ Status: 4 bytes
โโโ Bank references: 100 bytes
โโโ Metadata: 200 bytes
โโโ Total: ~500 bytes
Daily storage:
โโโ Transactions: 640M ร 500B = 320 GB/day
โโโ Audit logs: ~500 GB/day
โโโ Total: ~800 GB/day
Annual storage: ~300 TB/year
7-year retention: ~2 PB
Infrastructure Estimates
COMPUTE REQUIREMENTS
At 25,000 TPS peak, assuming each server handles 1,000 TPS:
โโโ API servers: 25+ servers (with redundancy: 50+)
โโโ Database: Clustered, sharded
โโโ Cache: Distributed Redis cluster
โโโ Message queues: High-throughput Kafka cluster
NETWORK
โโโ Connections to 680+ banks
โโโ Each bank: dedicated secure link
โโโ Geographic distribution: Multiple data centers
โโโ Bandwidth: Several Gbps
Phase 3: High-Level Architecture
You: "Let me draw how UPI actually works. It's a beautiful layered architecture."
The Three-Layer Cake
UPI ARCHITECTURE: THE THREE-LAYER CAKE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ LAYER 1: USER INTERFACE โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ PhonePe โ โ Google Pay โ โ Paytm โ โ BHIM โ โ
โ โ (App) โ โ (App) โ โ (App) โ โ (App) โ โ
โ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โ
โ โ โ โ โ โ
โ โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ PSP (Payment Service Providers) โ โ
โ โ โ โ
โ โ Apps must partner with a bank (PSP Bank) to access UPI โ โ
โ โ PhonePe โ Yes Bank Google Pay โ Multiple banks โ โ
โ โ The PSP handles: User onboarding, VPA creation, UI/UX โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ LAYER 2: NPCI SWITCH โ
โ โโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ NPCI UPI PLATFORM โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ โ
โ โ โ VPA โ โ Transaction โ โ Fraud โ โ โ
โ โ โ Mapper โ โ Router โ โ Detection โ โ โ
โ โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ โ
โ โ โ Settlement โ โ Audit โ โ Dispute โ โ โ
โ โ โ Engine โ โ Trail โ โ Resolution โ โ โ
โ โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ The brain of UPI: Routes transactions between banks โ โ
โ โ NPCI NEVER holds money โ only routes information โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ LAYER 3: BANKING LAYER โ
โ โโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ SBI โ โ HDFC โ โ ICICI โ โ Axis โ โ 680+ โ โ
โ โ Bank โ โ Bank โ โ Bank โ โ Bank โ โ Banks โ โ
โ โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ โ
โ โ โ โ โ โ โ
โ โโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ IMPS (Immediate Payment Service) โ โ
โ โ โ โ
โ โ The settlement rail that actually moves money between banks โ โ
โ โ UPI transactions settle via IMPS under the hood โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
KEY INSIGHT:
โข Money flows: Bank โ Bank (through IMPS)
โข Information flows: App โ PSP โ NPCI โ Banks
โข NPCI is the orchestrator, not a money holder
Transaction Flow
You: "Let me trace a โน500 payment from Alice to Bob..."
TRANSACTION FLOW: ALICE PAYS BOB โน500
Alice's Phone NPCI Banks
(PhonePe App) Switch
โ โ โ
โ โ INITIATE โ โ
โ โโโโโโโโโโโโโโโโโโโโโโถโ โ
โ "Pay โน500 to โ โ
โ bob@okaxis" โ โ
โ + Alice's VPA โ โ
โ + Encrypted PIN โ โ
โ โ โ
โ โ โก RESOLVE VPA โ
โ โ โโโโโโโโโโโโโโโโโโโโโโถโ
โ โ "Who is bob@okaxis?" โ
โ โ โ
โ โ โข VPA RESPONSE โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ "Bob's account at โ
โ โ Axis Bank: XXXX1234" โ
โ โ โ
โ โ โฃ DEBIT REQUEST โ
โ โ โโโโโโโโโโโโโโโโโโโโโโถโ
โ โ "Debit โน500 from โ HDFC
โ โ Alice at HDFC" โ (Alice's Bank)
โ โ โ
โ โ โค DEBIT RESPONSE โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ "Debited. Ref: ABC123"โ
โ โ โ
โ โ โฅ CREDIT REQUEST โ
โ โ โโโโโโโโโโโโโโโโโโโโโโถโ
โ โ "Credit โน500 to โ AXIS
โ โ Bob at Axis" โ (Bob's Bank)
โ โ โ
โ โ โฆ CREDIT RESPONSE โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ "Credited. Ref: XYZ789"โ
โ โ โ
โ โง SUCCESS โ โ
โโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ "Payment complete! โ โ
โ Ref: TXN123456" โ โ
โ โ โ
โผ โผ โผ
TOTAL TIME: < 2 seconds
PARALLEL ACTIONS:
โข Audit log written at each step
โข Fraud check runs during step โ
โข Push notifications sent to both Alice and Bob
โข Settlement record created for bank reconciliation
Phase 4: Deep Dives
Deep Dive 1: Virtual Payment Address (VPA) Resolution
Week 1 concepts: Partitioning, lookup optimization. Week 4 concepts: Caching.
You: "VPA resolution is called for EVERY transaction. With 640 million daily transactions, this lookup must be blazing fast."
The Challenge:
VPA RESOLUTION CHALLENGE
500+ million VPAs like:
โโโ alice@okhdfc
โโโ bob@okaxis
โโโ merchant@paytm
โโโ 9876543210@ybl
โโโ ...
Each VPA maps to:
โโโ Bank code
โโโ Account number (encrypted)
โโโ Account holder name
โโโ Status (active/blocked)
โโโ Metadata
Requirements:
โโโ Lookup latency: < 10ms
โโโ 100% accuracy (wrong mapping = money to wrong person!)
โโโ Real-time updates (user changes bank)
โโโ Handle 50,000+ lookups/second at peak
The Solution:
VPA MAPPER ARCHITECTURE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ VPA RESOLUTION FLOW โ
โ โ
โ โ
โ VPA: bob@okaxis โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ PARSE HANDLE โ Extract: handle="bob", suffix="okaxis" โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ SUFFIX LOOKUP โโโโโโถโ BANK REGISTRY โ โ
โ โ โ โ โ โ
โ โ "okaxis" โ Axis โ โ okhdfc โ HDFC โ โ
โ โ Bank Code โ โ okaxis โ Axis โ โ
โ โโโโโโโโโโฌโโโโโโโโโ โ paytm โ Paytm โ โ
โ โ โ ybl โ Yes Bank โ โ
โ โ โโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ CACHE CHECK โ โ
โ โ (Redis) โ โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ Cache Hit? โ
โ โ โ โ
โ Yes No โ
โ โ โ โ
โ โ โผ โ
โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ โ QUERY BANK โโโโโโถโ Axis Bank โ โ
โ โ โ (Real-time) โ โ VPA Database โ โ
โ โ โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โ Update cache โ
โ โ โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ RETURN ACCOUNT โ โ
โ โ DETAILS โ โ
โ โ โ โ
โ โ Bank: Axis โ โ
โ โ Account: ***234 โ โ
โ โ Name: Bob Kumar โ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
KEY DESIGN DECISIONS:
1. DISTRIBUTED OWNERSHIP
Each bank owns their VPA namespace (suffix)
NPCI doesn't store all VPAs โ banks do
This scales naturally as banks handle their own data
2. CACHING STRATEGY
Hot VPAs (frequently used) cached at NPCI
TTL: 15-30 minutes
Invalidation: Banks push updates for VPA changes
3. HANDLE UNIQUENESS
handle@suffix is globally unique
Banks ensure uniqueness within their namespace
Cross-bank uniqueness handled by suffix differentiation
# vpa/resolver.py
"""
VPA Resolution Service
Maps Virtual Payment Addresses to actual bank accounts.
This is the most critical lookup in the entire system.
"""
from dataclasses import dataclass
from typing import Optional, Tuple
from datetime import datetime, timedelta
import hashlib
@dataclass
class AccountDetails:
"""Resolved account information."""
bank_code: str
account_number_masked: str # Only last 4 digits visible
account_holder_name: str
ifsc_code: str
is_active: bool
verified_at: datetime
@dataclass
class VPAResolutionResult:
"""Result of VPA resolution."""
success: bool
account: Optional[AccountDetails] = None
error_code: Optional[str] = None
resolution_time_ms: float = 0
cache_hit: bool = False
class VPAResolver:
"""
Resolves VPAs to bank account details.
Design principles:
- Cache aggressively (VPAs don't change often)
- Fail fast on invalid formats
- Banks are the source of truth
"""
def __init__(
self,
cache, # Redis cluster
bank_registry, # Bank code โ Bank API mapping
bank_gateway, # Gateway to call bank APIs
metrics
):
self.cache = cache
self.registry = bank_registry
self.gateway = bank_gateway
self.metrics = metrics
# Cache settings
self.cache_ttl = timedelta(minutes=30)
self.negative_cache_ttl = timedelta(minutes=5)
async def resolve(self, vpa: str) -> VPAResolutionResult:
"""
Resolve a VPA to account details.
VPA format: handle@suffix
Example: alice@okhdfc, 9876543210@ybl
"""
start_time = datetime.utcnow()
# Step 1: Parse and validate VPA format
parsed = self._parse_vpa(vpa)
if not parsed:
return VPAResolutionResult(
success=False,
error_code="INVALID_VPA_FORMAT"
)
handle, suffix = parsed
# Step 2: Get bank code from suffix
bank_code = self.registry.get_bank_for_suffix(suffix)
if not bank_code:
return VPAResolutionResult(
success=False,
error_code="UNKNOWN_VPA_SUFFIX"
)
# Step 3: Check cache
cache_key = f"vpa:{vpa.lower()}"
cached = await self.cache.get(cache_key)
if cached:
if cached == "NOT_FOUND":
return VPAResolutionResult(
success=False,
error_code="VPA_NOT_FOUND",
cache_hit=True
)
account = AccountDetails(**cached)
return VPAResolutionResult(
success=True,
account=account,
resolution_time_ms=self._elapsed_ms(start_time),
cache_hit=True
)
# Step 4: Query the bank
try:
account = await self.gateway.resolve_vpa(
bank_code=bank_code,
handle=handle,
suffix=suffix
)
if account:
# Cache the result
await self.cache.set(
cache_key,
account.__dict__,
ttl=self.cache_ttl
)
return VPAResolutionResult(
success=True,
account=account,
resolution_time_ms=self._elapsed_ms(start_time),
cache_hit=False
)
else:
# Cache negative result (VPA doesn't exist)
await self.cache.set(
cache_key,
"NOT_FOUND",
ttl=self.negative_cache_ttl
)
return VPAResolutionResult(
success=False,
error_code="VPA_NOT_FOUND",
resolution_time_ms=self._elapsed_ms(start_time)
)
except BankTimeoutError:
return VPAResolutionResult(
success=False,
error_code="BANK_TIMEOUT"
)
except BankUnavailableError:
return VPAResolutionResult(
success=False,
error_code="BANK_UNAVAILABLE"
)
def _parse_vpa(self, vpa: str) -> Optional[Tuple[str, str]]:
"""Parse VPA into handle and suffix."""
if not vpa or '@' not in vpa:
return None
parts = vpa.lower().strip().split('@')
if len(parts) != 2:
return None
handle, suffix = parts
# Validate handle (alphanumeric, 3-50 chars)
if not handle or len(handle) < 3 or len(handle) > 50:
return None
# Validate suffix (registered bank suffix)
if not suffix or len(suffix) < 2 or len(suffix) > 20:
return None
return handle, suffix
def _elapsed_ms(self, start: datetime) -> float:
return (datetime.utcnow() - start).total_seconds() * 1000
Deep Dive 2: Atomic Transactions โ The Heart of Trust
Week 2 concepts: Idempotency, failure handling. Week 5 concepts: Distributed transactions, Saga pattern.
You: "The most critical requirement: money cannot disappear. If I debit Alice but fail to credit Bob, Alice must get her money back. Always."
The Challenge:
THE ATOMICITY CHALLENGE
Happy path:
โ Debit Alice (HDFC): โน500 โ
โก Credit Bob (Axis): โน500 โ
โ Success!
Failure scenarios:
SCENARIO A: Credit fails after debit
โ Debit Alice: โน500 โ (money left Alice's account)
โก Credit Bob: TIMEOUT โ (did it go through or not?)
โ UNCERTAINTY! Alice lost โน500?
SCENARIO B: Network partition
โ Debit Alice: โ
โก Credit Bob: Request sent...
โข Network dies
โฃ We don't know the outcome!
โ UNCERTAINTY!
SCENARIO C: Duplicate request
โ User clicks "Pay" twice quickly
โก Two debit requests sent
โ DOUBLE DEBIT! Alice loses โน1000?
These scenarios CANNOT happen in a payment system.
The Solution:
UPI'S TRANSACTION STATE MACHINE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ TRANSACTION STATE MACHINE โ
โ โ
โ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ CREATED โ โ
โ โ โ โ
โ โโโโโโโโฌโโโโโโโโ โ
โ โ โ
โ โ Validation passed โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ PENDING โ โ
โ โ โ โ
โ โโโโโโโโฌโโโโโโโโ โ
โ โ โ
โ โ Send to remitter bank โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ FAILED โโโโโโโโ DEBIT โโโโโโโถโ DEBITED โ โ
โ โ โ โ INITIATED โ โ โ โ
โ โ (No debit โ โ โ โ (Money left โ โ
โ โ happened) โ โโโโโโโโโโโโโโโโ โ sender) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโฌโโโโโโโโ โ
โ โ โ
โ โ Send to โ
โ โ beneficiary bank โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ REVERSED โโโโโโโโ CREDIT โโโโโโโถโ COMPLETED โ โ
โ โ โ โ INITIATED โ โ โ โ
โ โ (Money back โ โ โ โ (Money โ โ
โ โ to sender) โ โโโโโโโโโโโโโโโโ โ received) โ โ
โ โโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโ โ
โ โฒ โ โ
โ โ โ Credit timeout/failure โ
โ โ โผ โ
โ โ โโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโ DEEMED โ โ
โ Auto-reverse โ SUCCESS โ โ
โ after T+2 โ โ โ
โ โ (Uncertain โ โ
โ โ state) โ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ
โ DEEMED SUCCESS: Bank didn't respond in time. โ
โ Settlement happens, if credit actually failed, auto-reversal at T+2 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# transaction/processor.py
"""
Transaction Processing with Atomic Guarantees.
This is the most critical code in the entire system.
Money cannot be lost under any circumstances.
"""
from dataclasses import dataclass
from typing import Optional
from enum import Enum
from datetime import datetime, timedelta
import uuid
class TransactionState(Enum):
CREATED = "created"
PENDING = "pending"
DEBIT_INITIATED = "debit_initiated"
DEBITED = "debited"
CREDIT_INITIATED = "credit_initiated"
COMPLETED = "completed"
FAILED = "failed"
DEEMED_SUCCESS = "deemed_success"
REVERSED = "reversed"
@dataclass
class Transaction:
"""A UPI transaction record."""
txn_id: str
sender_vpa: str
receiver_vpa: str
amount: int # In paise (smallest unit)
state: TransactionState
created_at: datetime
updated_at: datetime
# Bank references
sender_bank_ref: Optional[str] = None
receiver_bank_ref: Optional[str] = None
# For idempotency
idempotency_key: Optional[str] = None
# Reversal tracking
reversal_initiated: bool = False
reversal_completed: bool = False
class TransactionProcessor:
"""
Processes UPI transactions with atomic guarantees.
Key principles:
1. IDEMPOTENCY: Same request = same result (no double-debit)
2. ATOMICITY: Either complete success or complete rollback
3. DURABILITY: State persisted before any bank call
4. RECOVERABILITY: Can resume from any failure point
"""
def __init__(
self,
db, # Transaction database
bank_gateway, # Bank API gateway
reversal_queue, # Queue for failed transactions
audit_log
):
self.db = db
self.gateway = bank_gateway
self.reversal_queue = reversal_queue
self.audit = audit_log
# Timeouts
self.debit_timeout = timedelta(seconds=30)
self.credit_timeout = timedelta(seconds=30)
async def process(
self,
sender_vpa: str,
receiver_vpa: str,
amount: int,
idempotency_key: str
) -> Transaction:
"""
Process a payment transaction.
CRITICAL: This method must be idempotent.
Same idempotency_key = same result, always.
"""
# STEP 0: Check idempotency
existing = await self.db.get_by_idempotency_key(idempotency_key)
if existing:
# Return existing result (no reprocessing)
await self.audit.log("IDEMPOTENT_RETURN", existing.txn_id)
return existing
# STEP 1: Create transaction record FIRST
txn = Transaction(
txn_id=str(uuid.uuid4()),
sender_vpa=sender_vpa,
receiver_vpa=receiver_vpa,
amount=amount,
state=TransactionState.CREATED,
created_at=datetime.utcnow(),
updated_at=datetime.utcnow(),
idempotency_key=idempotency_key
)
# Persist BEFORE any bank call
await self.db.save(txn)
await self.audit.log("TXN_CREATED", txn.txn_id)
try:
# STEP 2: Initiate debit
txn.state = TransactionState.DEBIT_INITIATED
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
debit_result = await self.gateway.debit(
vpa=sender_vpa,
amount=amount,
txn_ref=txn.txn_id,
timeout=self.debit_timeout
)
if not debit_result.success:
# Debit failed cleanly โ no money moved
txn.state = TransactionState.FAILED
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("DEBIT_FAILED", txn.txn_id,
debit_result.error)
return txn
# STEP 3: Debit succeeded โ record it
txn.state = TransactionState.DEBITED
txn.sender_bank_ref = debit_result.bank_reference
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("DEBIT_SUCCESS", txn.txn_id)
# STEP 4: Initiate credit
# CRITICAL: From this point, we MUST either complete or reverse
txn.state = TransactionState.CREDIT_INITIATED
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
credit_result = await self.gateway.credit(
vpa=receiver_vpa,
amount=amount,
txn_ref=txn.txn_id,
timeout=self.credit_timeout
)
if credit_result.success:
# SUCCESS! Transaction complete
txn.state = TransactionState.COMPLETED
txn.receiver_bank_ref = credit_result.bank_reference
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("TXN_COMPLETED", txn.txn_id)
return txn
elif credit_result.status == "TIMEOUT":
# UNCERTAINTY: We don't know if credit happened
# Mark as DEEMED_SUCCESS โ settlement will clarify
txn.state = TransactionState.DEEMED_SUCCESS
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("TXN_DEEMED_SUCCESS", txn.txn_id)
# Schedule reconciliation check
await self.reversal_queue.schedule_check(
txn.txn_id,
check_at=datetime.utcnow() + timedelta(hours=24)
)
return txn
else:
# Credit FAILED โ must reverse the debit
await self._initiate_reversal(txn)
return txn
except Exception as e:
# Unexpected error โ check state and recover
await self.audit.log("TXN_ERROR", txn.txn_id, str(e))
await self._handle_error(txn, e)
raise
async def _initiate_reversal(self, txn: Transaction):
"""
Reverse a failed transaction.
Credit the debited amount back to sender.
"""
txn.reversal_initiated = True
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("REVERSAL_INITIATED", txn.txn_id)
# Queue for reversal (handled by separate process)
await self.reversal_queue.enqueue(txn.txn_id)
async def process_reversal(self, txn_id: str):
"""
Execute reversal โ credit money back to sender.
Called by reversal worker.
"""
txn = await self.db.get(txn_id)
if txn.reversal_completed:
return # Already reversed
# Credit back to sender
reversal_result = await self.gateway.credit(
vpa=txn.sender_vpa,
amount=txn.amount,
txn_ref=f"REV-{txn.txn_id}",
timeout=self.credit_timeout
)
if reversal_result.success:
txn.state = TransactionState.REVERSED
txn.reversal_completed = True
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("REVERSAL_COMPLETED", txn.txn_id)
else:
# Reversal failed โ retry later
# This is a critical alert scenario
await self.audit.log("REVERSAL_FAILED", txn.txn_id,
reversal_result.error)
await self.reversal_queue.schedule_retry(
txn.txn_id,
retry_at=datetime.utcnow() + timedelta(minutes=15)
)
Interviewer: "What about the 'deemed success' state? That seems risky."
You: "Great catch. Here's how reconciliation handles it..."
DEEMED SUCCESS RECONCILIATION
Scenario: We debited Alice, tried to credit Bob, got TIMEOUT
At NPCI level:
โโโ Transaction marked DEEMED_SUCCESS
โโโ We don't know if Bob got money
โโโ Settlement file sent to banks includes this transaction
At Bank level (T+1 reconciliation):
โโโ Bank compares settlement file with actual credits
โโโ If credit happened: Mark as COMPLETED
โโโ If credit NOT happened: Mark as FAILED โ Auto-reversal
Timing:
โโโ T+0: Transaction happens, deemed success
โโโ T+1: Banks reconcile, report actual status
โโโ T+2: NPCI updates final status
โโโ T+2: If failed, reversal initiated automatically
This is why UPI guidelines say:
"If money is debited but not credited, it will be
automatically reversed within 5 business days"
In practice, it's usually resolved within 24-48 hours.
Deep Dive 3: Bank Integration at Scale
Week 2 concepts: Timeouts, circuit breakers. Week 3 concepts: Message queues.
You: "With 680+ banks, each with different legacy systems, bank integration is a massive challenge."
BANK INTEGRATION ARCHITECTURE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ NPCI โ BANK GATEWAY โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ BANK ADAPTER LAYER โ โ
โ โ โ โ
โ โ Every bank exposes standard UPI APIs, but internal โ โ
โ โ implementations vary wildly. The adapter handles this. โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ
โ โ โ SBI Adapter โ โ HDFC Adapterโ โ Axis Adapterโ ... โ โ
โ โ โ โ โ โ โ โ โ โ
โ โ โ Handles: โ โ Handles: โ โ Handles: โ โ โ
โ โ โ - SBI's โ โ - HDFC's โ โ - Axis's โ โ โ
โ โ โ quirks โ โ quirks โ โ quirks โ โ โ
โ โ โ - Retry โ โ - Retry โ โ - Retry โ โ โ
โ โ โ logic โ โ logic โ โ logic โ โ โ
โ โ โ - Timeouts โ โ - Timeouts โ โ - Timeouts โ โ โ
โ โ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โ โ
โ โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ CIRCUIT BREAKER LAYER โ โ โ
โ โ โ โ โ โ
โ โ โ Per-bank circuit breakers prevent cascade failures โ โ โ
โ โ โ โ โ โ
โ โ โ SBI: [CLOSED] โโโโโโโโโโ (healthy) โ โ โ
โ โ โ HDFC: [CLOSED] โโโโโโโโโโ (healthy) โ โ โ
โ โ โ Axis: [OPEN] โโโโโโโโโโ (failing, skip for 30s) โ โ โ
โ โ โ ICICI: [HALF] โโโโโโโโโโ (testing recovery) โ โ โ
โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ SECURE COMMUNICATION LAYER โ โ โ
โ โ โ โ โ โ
โ โ โ - HTTPS with mutual TLS โ โ โ
โ โ โ - Request/Response signing โ โ โ
โ โ โ - Encryption of sensitive data โ โ โ
โ โ โ - IP whitelisting โ โ โ
โ โ โ - Dedicated leased lines to major banks โ โ โ
โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# bank/gateway.py
"""
Bank Gateway: Unified interface to 680+ banks.
Each bank is different. This gateway provides a
consistent interface while handling per-bank quirks.
"""
from dataclasses import dataclass
from typing import Dict, Optional
from datetime import datetime, timedelta
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Bank failing, don't try
HALF_OPEN = "half" # Testing if bank recovered
@dataclass
class BankConfig:
"""Configuration for a bank."""
bank_code: str
endpoint: str
timeout_ms: int = 30000
# Circuit breaker settings
failure_threshold: int = 5
recovery_timeout_s: int = 30
# Bank-specific quirks
requires_padding: bool = False
amount_in_rupees: bool = False # Some banks want rupees, not paise
legacy_xml_format: bool = False
class BankGateway:
"""
Gateway for all bank operations.
"""
def __init__(
self,
http_client,
bank_configs: Dict[str, BankConfig],
metrics
):
self.http = http_client
self.configs = bank_configs
self.metrics = metrics
# Circuit breakers per bank
self.circuits: Dict[str, CircuitBreaker] = {
code: CircuitBreaker(config)
for code, config in bank_configs.items()
}
async def debit(
self,
bank_code: str,
account_ref: str,
amount: int,
txn_ref: str
) -> 'BankResponse':
"""
Debit an account at a bank.
Amount is in paise (smallest unit).
"""
config = self.configs.get(bank_code)
if not config:
return BankResponse(
success=False,
error_code="UNKNOWN_BANK"
)
# Check circuit breaker
circuit = self.circuits[bank_code]
if not circuit.can_execute():
self.metrics.increment("bank_circuit_open", bank_code)
return BankResponse(
success=False,
error_code="BANK_CIRCUIT_OPEN"
)
try:
# Build request (handle bank-specific formats)
request = self._build_debit_request(
config, account_ref, amount, txn_ref
)
# Make the call
start = datetime.utcnow()
response = await self.http.post(
config.endpoint + "/debit",
json=request,
timeout=config.timeout_ms / 1000
)
latency = (datetime.utcnow() - start).total_seconds() * 1000
# Record metrics
self.metrics.record_latency("bank_debit", bank_code, latency)
# Parse response
result = self._parse_response(config, response)
if result.success:
circuit.record_success()
else:
circuit.record_failure()
return result
except TimeoutError:
circuit.record_failure()
self.metrics.increment("bank_timeout", bank_code)
return BankResponse(
success=False,
error_code="TIMEOUT",
status="TIMEOUT"
)
except Exception as e:
circuit.record_failure()
self.metrics.increment("bank_error", bank_code)
return BankResponse(
success=False,
error_code="BANK_ERROR",
error_message=str(e)
)
def _build_debit_request(
self,
config: BankConfig,
account_ref: str,
amount: int,
txn_ref: str
) -> dict:
"""Build bank-specific request format."""
# Handle amount format (paise vs rupees)
if config.amount_in_rupees:
amount_value = amount / 100
else:
amount_value = amount
if config.legacy_xml_format:
# Some old banks still use XML
return {
"xml_payload": self._build_xml(
account_ref, amount_value, txn_ref
)
}
return {
"account_reference": account_ref,
"amount": amount_value,
"transaction_reference": txn_ref,
"timestamp": datetime.utcnow().isoformat()
}
class CircuitBreaker:
"""
Circuit breaker for bank connections.
Prevents cascade failures when a bank is down.
"""
def __init__(self, config: BankConfig):
self.config = config
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure: Optional[datetime] = None
self.success_count = 0
def can_execute(self) -> bool:
"""Check if we can make a request to this bank."""
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
# Check if recovery timeout passed
if self.last_failure:
elapsed = (datetime.utcnow() - self.last_failure).total_seconds()
if elapsed > self.config.recovery_timeout_s:
self.state = CircuitState.HALF_OPEN
self.success_count = 0
return True
return False
if self.state == CircuitState.HALF_OPEN:
return True
return False
def record_success(self):
"""Record a successful request."""
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= 3: # 3 successes to close
self.state = CircuitState.CLOSED
self.failure_count = 0
else:
self.failure_count = 0
def record_failure(self):
"""Record a failed request."""
self.failure_count += 1
self.last_failure = datetime.utcnow()
if self.state == CircuitState.HALF_OPEN:
# Back to open
self.state = CircuitState.OPEN
self.success_count = 0
elif self.failure_count >= self.config.failure_threshold:
self.state = CircuitState.OPEN
Deep Dive 4: Security โ The Trust Foundation
Week 9 concepts: Security, authentication, fraud detection.
You: "UPI handles โน20+ trillion monthly. Security isn't optional โ it's existential."
UPI SECURITY ARCHITECTURE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ MULTI-LAYER SECURITY โ
โ โ
โ LAYER 1: DEVICE BINDING โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โข UPI PIN is bound to specific device โ
โ โข Device fingerprint (IMEI, hardware ID) โ
โ โข SIM binding (mobile number verification) โ
โ โข If device changes, re-registration required โ
โ โ
โ LAYER 2: TWO-FACTOR AUTHENTICATION โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Factor 1: Something you HAVE โ
โ โข The registered mobile device โ
โ โข The SIM card with registered number โ
โ โ
โ Factor 2: Something you KNOW โ
โ โข 4-6 digit UPI PIN (set by user) โ
โ โข PIN encrypted on device, never transmitted in clear โ
โ โ
โ LAYER 3: ENCRYPTION โ
โ โโโโโโโโโโโโโโโโโโโโโ โ
โ โข HTTPS/TLS for all communication โ
โ โข UPI PIN encrypted using PBKDF2 (600,000 iterations) โ
โ โข PIN verification in bank's HSM (Hardware Security Module) โ
โ โข End-to-end encryption for sensitive data โ
โ โ
โ LAYER 4: TRANSACTION SIGNING โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โข Each transaction signed with digital signature โ
โ โข Prevents tampering in transit โ
โ โข Non-repudiation for disputes โ
โ โ
โ LAYER 5: REAL-TIME FRAUD DETECTION โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โข Velocity checks (too many transactions too fast) โ
โ โข Amount anomaly detection โ
โ โข Geo-location checks (impossible travel) โ
โ โข Behavioral analysis โ
โ โข Block list matching โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
INFRASTRUCTURE SECURITY
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ NPCI DATA CENTERS โ
โ โโโโโโโโโโโโโโโโโ โ
โ โ
โ โข Tier-IV certified (99.995% uptime) โ
โ โข Chennai + Hyderabad (geographically separated) โ
โ โข FIPS 140-2 Level 3 certified HSMs โ
โ โข Active-Active configuration โ
โ โข N+N redundancy โ
โ โ
โ โข Physical security: โ
โ โโโ Biometric access control โ
โ โโโ 24/7 security personnel โ
โ โโโ CCTV surveillance โ
โ โโโ Man-trap entries โ
โ โ
โ BANK CONNECTIONS โ
โ โโโโโโโโโโโโโโโโ โ
โ โข Dedicated leased lines (not public internet) โ
โ โข Mutual TLS authentication โ
โ โข IP whitelisting โ
โ โข Regular security audits โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# security/fraud_detector.py
"""
Real-Time Fraud Detection for UPI.
Must decide in < 50ms whether to allow a transaction.
"""
from dataclasses import dataclass
from typing import List, Tuple
from datetime import datetime, timedelta
@dataclass
class FraudSignals:
"""Signals used for fraud detection."""
user_id: str
device_id: str
amount: int
receiver_vpa: str
# Velocity
txn_count_1hr: int
txn_count_24hr: int
total_amount_24hr: int
# Device
is_new_device: bool
device_age_days: int
# Behavioral
is_new_receiver: bool
typical_amount: int
typical_time_of_day: List[int]
current_hour: int
# Location
device_location: str
usual_locations: List[str]
class FraudDetector:
"""
Real-time fraud detection.
Must be FAST (< 50ms) and ACCURATE (low false positives).
"""
def __init__(self, ml_model, rules_engine, blocklist):
self.model = ml_model
self.rules = rules_engine
self.blocklist = blocklist
async def evaluate(
self,
signals: FraudSignals
) -> Tuple[str, float, List[str]]:
"""
Evaluate fraud risk.
Returns: (decision, confidence, triggered_rules)
decision: "ALLOW", "BLOCK", "STEP_UP"
"""
triggered_rules = []
# RULE 1: Blocklist check (instant)
if await self.blocklist.is_blocked(signals.user_id):
return "BLOCK", 1.0, ["USER_BLOCKED"]
if await self.blocklist.is_blocked(signals.device_id):
return "BLOCK", 1.0, ["DEVICE_BLOCKED"]
# RULE 2: Velocity checks
if signals.txn_count_1hr > 10:
triggered_rules.append("HIGH_VELOCITY_1HR")
if signals.txn_count_24hr > 50:
triggered_rules.append("HIGH_VELOCITY_24HR")
if signals.total_amount_24hr > 100000_00: # โน1 lakh in paise
triggered_rules.append("HIGH_AMOUNT_24HR")
# RULE 3: Amount anomaly
if signals.amount > signals.typical_amount * 10:
triggered_rules.append("AMOUNT_ANOMALY")
# RULE 4: New device
if signals.is_new_device:
triggered_rules.append("NEW_DEVICE")
if signals.amount > 10000_00: # > โน10,000 on new device
triggered_rules.append("HIGH_AMOUNT_NEW_DEVICE")
# RULE 5: Unusual time
if signals.current_hour not in signals.typical_time_of_day:
triggered_rules.append("UNUSUAL_TIME")
# RULE 6: Location check
if signals.device_location not in signals.usual_locations:
triggered_rules.append("UNUSUAL_LOCATION")
# ML model for complex patterns
ml_score = await self.model.predict(signals)
# Decision logic
if ml_score > 0.9 or len(triggered_rules) > 3:
return "BLOCK", ml_score, triggered_rules
if ml_score > 0.7 or len(triggered_rules) > 1:
# Step-up: require additional verification
return "STEP_UP", ml_score, triggered_rules
return "ALLOW", 1 - ml_score, triggered_rules
Phase 5: Scaling and Edge Cases
Interviewer: "What happens during Diwali when everyone is sending money?"
You: "UPI handles 5x spikes during festivals. Here's how..."
Festival Traffic Management
DIWALI SCALE (5X NORMAL TRAFFIC)
Normal day:
โโโ ~640 million transactions
โโโ ~7,400 average TPS
โโโ ~25,000 peak TPS
Diwali:
โโโ ~3 billion transactions
โโโ ~35,000 average TPS
โโโ ~150,000+ peak TPS
โโโ Concentrated in evening hours (7 PM - 11 PM)
PREPARATION (Weeks Before):
โโโ Pre-scale infrastructure to 3x capacity
โโโ Warm up caches with popular VPAs
โโโ Notify banks to scale their systems
โโโ Extended support staff on standby
โโโ Runbooks reviewed and tested
DURING THE EVENT:
โโโ Auto-scaling triggers at 60% capacity
โโโ Non-critical features disabled (promotional notifications)
โโโ Enhanced monitoring (5-second alert intervals)
โโโ War room with all bank representatives
โโโ Direct escalation paths to bank CTOs
GRACEFUL DEGRADATION:
If overwhelmed:
โโโ Prioritize smaller transactions (more users served)
โโโ Rate limit per-user (max 5 txn/minute)
โโโ Queue non-urgent operations (mandate registrations)
โโโ Return "Try again in few minutes" vs hard failure
Critical Edge Cases
EDGE CASE 1: Bank System Down
Problem: SBI (largest bank, 30% market share) goes down
Impact: 30% of transactions fail
Solution:
โโโ Circuit breaker opens for SBI immediately
โโโ Return clear error: "SBI temporarily unavailable"
โโโ Pending transactions queued (if bank supports retry)
โโโ Status page updated
โโโ Auto-retry when circuit closes
โโโ Transactions involving SBI gracefully rejected
EDGE CASE 2: NPCI Switch Partial Failure
Problem: One NPCI data center fails
Impact: 50% capacity lost
Solution:
โโโ Active-Active setup in Chennai and Hyderabad
โโโ Traffic automatically routes to healthy DC
โโโ DNS TTL is low (60 seconds) for fast failover
โโโ Data replicated synchronously between DCs
โโโ RPO: 0 (no data loss), RTO: < 30 seconds
EDGE CASE 3: Duplicate Transaction Request
Problem: User's app times out, they retry, but first request succeeded
Impact: Double debit
Solution:
โโโ Every transaction has idempotency key
โโโ Generated on client: device_id + timestamp + amount + receiver
โโโ NPCI checks idempotency before processing
โโโ If duplicate: return original result
โโโ No double processing possible
EDGE CASE 4: April 2025 Outage (Real Incident)
What happened:
โโโ Banks were calling "Check Transaction Status" API excessively
โโโ Some banks called for old transactions repeatedly
โโโ NPCI didn't enforce rate limits on this API
โโโ API flooded, entire system degraded
Lesson learned:
โโโ Rate limit ALL APIs, not just transaction APIs
โโโ Enforce guidelines at NPCI firewall, not just bank side
โโโ Separate critical path APIs from status check APIs
โโโ Circuit breaker for misbehaving banks
Phase 6: Monitoring and Operations
You: "For a system processing โน20 trillion monthly, monitoring isn't optional."
Key Metrics Dashboard
UPI OPERATIONS DASHBOARD
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ REAL-TIME HEALTH โ
โ โโโโโโโโโโโโโโโโโ โ
โ โ
โ Transaction Rate Success Rate Latency (p99) โ
โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโ โ
โ 7,842 TPS 99.2% 847 ms โ
โ (Target: 10K) (Target: 99.0%) (Target: 1000ms) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ BANK HEALTH MATRIX โ
โ โโโโโโโโโโโโโโโโโโ โ
โ โ
โ Bank TPS Success Latency Circuit Issues โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ SBI 2,341 98.7% 423ms CLOSED Minor lag โ
โ HDFC 1,856 99.8% 287ms CLOSED Healthy โ
โ ICICI 1,234 99.5% 312ms CLOSED Healthy โ
โ Axis 987 94.2% 892ms HALF-OPEN HIGH LATENCY โ
โ Kotak 654 99.1% 345ms CLOSED Healthy โ
โ Yes Bank 543 99.4% 298ms CLOSED Healthy โ
โ ... โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ ALERTS (Last 1 hour) โ
โ โโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ ๐ด 14:32 - Axis Bank latency > 800ms (CRITICAL) โ
โ ๐ก 14:28 - SBI error rate 1.3% (Warning) โ
โ ๐ข 14:15 - Axis Bank circuit half-open (Info) โ
โ ๐ข 13:45 - Traffic spike +20% (Auto-scaled) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SLOs for UPI
UPI SERVICE LEVEL OBJECTIVES
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ SLO 1: TRANSACTION SUCCESS RATE โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Target: 99.0% of transactions succeed โ
โ Measurement: Successful / Total (excluding user errors) โ
โ Current: 99.2% โ
โ โ
โ Exclusions: โ
โ โข Insufficient balance (user error) โ
โ โข Wrong PIN (user error) โ
โ โข Account blocked (compliance) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ SLO 2: END-TO-END LATENCY โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Target: 99% of transactions complete in < 2 seconds โ
โ Measurement: Time from request received to response sent โ
โ Current: p99 = 1.2 seconds โ
โ โ
โ Breakdown: โ
โ โข NPCI processing: < 300ms โ
โ โข Bank response (each): < 800ms โ
โ โข Network overhead: < 200ms โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ SLO 3: AVAILABILITY โ
โ โโโโโโโโโโโโโโโโโโโโ โ
โ Target: 99.9% uptime โ
โ Measurement: (Total time - Downtime) / Total time โ
โ Allowed downtime: 8.7 hours/year โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ SLO 4: MONEY SAFETY โ
โ โโโโโโโโโโโโโโโโโโโโโ โ
โ Target: 100% of debited amounts credited or reversed โ
โ Measurement: No money stuck > 5 business days โ
โ Current: 99.99% resolved within 24 hours โ
โ โ
โ This is NON-NEGOTIABLE. Error budget = 0 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Interview Conclusion
Interviewer: "Impressive depth. A few rapid-fire questions:"
Interviewer: "Why didn't India just adopt an existing system like Visa/Mastercard?"
You: "Three reasons:
- Cost: Card networks charge 1.5-3% per transaction. UPI is nearly free.
- Inclusion: Cards need credit checks, plastic production, POS terminals. UPI needs only a phone.
- Control: Critical financial infrastructure shouldn't depend on foreign companies.
The result: UPI enabled the chai vendor to accept digital payments for a โน10 tea."
Interviewer: "What's the biggest technical achievement of UPI?"
You: "Interoperability without centralized money holding. NPCI routes transactions but never touches the money. This means:
- No counterparty risk (NPCI can't go bankrupt with your money)
- Banks remain the regulated entities
- Scales infinitely (NPCI is just a switch)
- Any app works with any bank
This architecture is why countries worldwide are studying UPI."
Interviewer: "If you were to improve UPI today, what would you change?"
You: "Based on the April 2025 outage:
- Stricter rate limiting at NPCI level, not trusting banks to self-regulate
- Better isolation between critical transaction APIs and status check APIs
- More granular circuit breakers โ per-API, not just per-bank
- Chaos engineering โ regularly test failure scenarios in production"
Summary: Concepts Applied from 10-Week Course
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ CONCEPTS FROM 10-WEEK COURSE IN UPI DESIGN โ
โ โ
โ WEEK 1: DATA AT SCALE โ
โ โโโ Partitioning: VPAs partitioned by bank suffix โ
โ โโโ Replication: Multi-DC active-active setup โ
โ โโโ Read optimization: VPA caching at NPCI level โ
โ โ
โ WEEK 2: FAILURE-FIRST DESIGN โ
โ โโโ Timeouts: Strict timeouts for bank calls (30s) โ
โ โโโ Circuit breakers: Per-bank failure isolation โ
โ โโโ Idempotency: Transaction idempotency keys โ
โ โโโ Retries: Smart retry with exponential backoff โ
โ โ
โ WEEK 3: MESSAGING & ASYNC โ
โ โโโ Transactional outbox: Audit logging โ
โ โโโ Dead letter queues: Failed reversal handling โ
โ โโโ Event streaming: Transaction events for reconciliation โ
โ โ
โ WEEK 4: CACHING โ
โ โโโ VPA resolution caching โ
โ โโโ Bank configuration caching โ
โ โโโ Negative caching: Non-existent VPAs โ
โ โ
โ WEEK 5: CONSISTENCY & COORDINATION โ
โ โโโ Distributed transactions: Debit-then-credit with rollback โ
โ โโโ State machine: Transaction lifecycle management โ
โ โโโ Exactly-once semantics: Idempotency guarantees โ
โ โ
โ WEEK 9: SECURITY & COMPLIANCE โ
โ โโโ Multi-factor authentication: Device + PIN โ
โ โโโ Encryption: PBKDF2, HSM-based PIN verification โ
โ โโโ Fraud detection: Real-time ML scoring โ
โ โโโ Audit trail: Complete transaction logging โ
โ โ
โ WEEK 10: PRODUCTION READINESS โ
โ โโโ SLOs: Success rate, latency, availability targets โ
โ โโโ Observability: Per-bank health dashboards โ
โ โโโ Capacity planning: Festival traffic handling โ
โ โโโ Incident management: April 2025 outage learnings โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Why UPI Matters
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ WHY UPI IS A MARVEL OF ENGINEERING โ
โ โ
โ SCALE โ
โ โโโโโ โ
โ โข 50% of world's digital transactions โ
โ โข More than Visa + Mastercard combined (in India) โ
โ โข 640+ million transactions DAILY โ
โ โ
โ INCLUSION โ
โ โโโโโโโโโ โ
โ โข Works on โน3,000 smartphones โ
โ โข Works on 2G networks โ
โ โข โน1 transactions viable (no minimums) โ
โ โข Enabled 300 million+ previously unbanked Indians โ
โ โ
โ COST โ
โ โโโโ โ
โ โข Zero cost to consumers โ
โ โข Near-zero cost to small merchants โ
โ โข Saved billions in card network fees โ
โ โ
โ INNOVATION โ
โ โโโโโโโโโโ โ
โ โข VPA system (email for money) โ
โ โข Interoperable (any app, any bank) โ
โ โข Open standard (countries can adopt) โ
โ โข Built on existing bank infrastructure โ
โ โ
โ GLOBAL IMPACT โ
โ โโโโโโโโโโโโโ โ
โ โข 8+ countries accepting UPI โ
โ โข 10+ countries studying UPI for adoption โ
โ โข Model for BIS cross-border payment initiatives โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ "UPI proved that with the right architecture, a developing nation โ
โ can leapfrog decades of financial infrastructure and build โ
โ something the developed world envies." โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Self-Assessment Checklist
After studying this case study, you should be able to:
Architecture:
- Explain the three-layer UPI architecture (Apps โ NPCI โ Banks)
- Design a VPA resolution system with caching
- Implement atomic transactions with rollback capability
Distributed Systems:
- Handle partial failures in multi-party transactions
- Implement circuit breakers for unreliable dependencies
- Design idempotency for payment systems
Scale:
- Calculate infrastructure needs for billion-transaction systems
- Plan for bursty traffic (festivals, events)
- Implement graceful degradation under load
Security:
- Design multi-factor authentication for payments
- Implement real-time fraud detection
- Understand HSM-based PIN verification
Operations:
- Define meaningful SLOs for payment systems
- Monitor multi-party systems (NPCI + 680 banks)
- Learn from production incidents (April 2025 outage)
Sources
Statistics and Data:
- NPCI Official UPI Statistics: https://www.npci.org.in/product/upi/product-statistics
- Business Standard - UPI December 2024 Data: https://www.business-standard.com/finance/news/upi-transactions-surge-to-record-16-73-bn-in-dec-value-at-rs-23-25-trn-125010100457_1.html
- DemandSage UPI Statistics 2025: https://www.demandsage.com/upi-statistics/
- Meetanshi UPI Statistics: https://meetanshi.com/blog/upi-statistics/
- GrabOn UPI Statistics: https://www.grabon.in/indulge/tech/upi-statistics/
Architecture and Technical Details:
- Wikipedia - Unified Payments Interface: https://en.wikipedia.org/wiki/Unified_Payments_Interface
- ByteByteGo - UPI Architecture: https://bytebytego.com/guides/unified-payments-interface-upi-in-india/
- GeeksforGeeks - Designing UPI System Design: https://www.geeksforgeeks.org/designing-upi-system-design/
- Medium - Deep Dive System Design of UPI: https://medium.com/@avinashkariya05910/deep-dive-system-design-of-upi-unified-payments-interface-eff3b0334b0d
- Brickendon Consulting - UPI Technical Overview: https://www.brickendon.com/insights/unified-payments-interface-upi/
- Dev.to - System Design UPI: https://dev.to/zeeshanali0704/system-design-upi-unified-payment-interface-2ng3
Infrastructure and Security:
- ITNews Asia - NPCI Data Center Modernization: https://www.itnews.asia/news/indias-npci-modernises-data-centres-using-kyndryls-cloud-services-592061
- NPCI Smart Data Center Press Release: https://www.npci.org.in/PDF/npci/press-releases/2020/NPCI_Press_Release-NPCI_to_launch_Smart_Data_Center_in_Hyderabad.pdf
- The420.in - UPI Security Infrastructure: https://the420.in/upi-digital-payments-infrastructure-security-npci-cpt-analysis/
- Blog - UPI Security Architecture Deep Dive: https://blog.akshanshjaiswal.com/the-upi-architecture-a-security-look
Settlement and Operations:
- BillCut - Settlement Latency Benchmarks: https://www.billcut.com/blogs/settlement-latency-benchmarks-whos-fastest/
- Razorpay - UPI Payout Processing: https://razorpay.com/blog/business-banking/payout-processing-imps-upi-transactions-deemed-success-npci/
- Inc42 - NPCI Real-Time Fix for UPI Failures: https://inc42.com/buzz/ncpi-working-on-real-time-fix-for-upi-transaction-failures/
- BIS Papers - Faster Digital Payments India: https://www.bis.org/publ/bppdf/bispap152_e_rh.pdf
Outage Analysis (April 2025):
- Wikipedia - UPI Outage Details: https://en.wikipedia.org/wiki/Unified_Payments_Interface
Further Reading
Official Documentation:
- NPCI Official Website: https://www.npci.org.in/
- NPCI UPI Product Page: https://www.npci.org.in/what-we-do/upi/product-overview
- RBI Payment Systems: https://www.rbi.org.in/Scripts/PaymentSystems_UM.aspx
- UPI Procedural Guidelines: https://yashada.org/yashada_2019/pdfs/e_library_cit/edpri_UPI_Procedural_Guidelines.pdf
Engineering Blogs and Technical Deep Dives:
- Razorpay Engineering Blog: https://razorpay.com/blog/ (Multiple articles on UPI integration)
- Paytm Engineering: https://paytm.com/blog/ (UPI transaction insights)
- ByteByteGo Newsletter: https://blog.bytebytego.com/ (System design breakdowns)
- LinkedIn Engineering Posts: Search "UPI Architecture" for practitioner insights
Research Papers and Reports:
- BIS Papers No. 152: Faster Digital Payments - Global and Regional Perspectives (India Chapter)
- NPCI White Papers: Available on NPCI website
- RBI Annual Reports: Digital payments statistics and trends
News and Industry Analysis:
- Medianama: https://www.medianama.com/ (Digital payments coverage)
- Economic Times Tech: https://economictimes.indiatimes.com/tech (Fintech news)
- Inc42: https://inc42.com/ (Startup and fintech coverage)
- LiveMint: https://www.livemint.com/ (Financial news)
Video Resources:
- NPCI YouTube Channel: Official explainers and announcements
- System Design Interview Videos: Search "UPI System Design" on YouTube
Books:
- "Designing Data-Intensive Applications" by Martin Kleppmann - Foundational concepts
- "System Design Interview" by Alex Xu - Interview preparation with similar patterns
Related Systems to Study:
- PIX (Brazil): Similar instant payment system
- FedNow (USA): US real-time payment system
- SEPA Instant (Europe): European instant payments
End of Bonus Problem 1: India's UPI
"A payment system that serves a billion people, handles trillions in transactions, and costs nothing to use. This is what engineering at scale looks like."