Bonus Problem 1: India's UPI
The World's Largest Real-Time Payment System
š®š³ A Revolution That Changed a Nation
In 2016, India launched an experiment. Could a country of 1.4 billion people, many unbanked, leapfrog decades of payment infrastructure and go directly to real-time digital payments?
Eight years later, the answer is extraordinary.
THE NUMBERS THAT DEFINE UPI (2025)
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā DAILY TRANSACTIONS MONTHLY TRANSACTIONS ā
ā āāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāā ā
ā 640+ Million 20+ Billion ā
ā ā
ā ANNUAL TRANSACTIONS ANNUAL VALUE ā
ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāā ā
ā 250+ Billion $3.4+ Trillion (ā¹247 Lakh Crore) ā
ā ā
ā PARTICIPATING BANKS ACTIVE USERS ā
ā āāāāāāāāāāāāāāāāāā āāāāāāāāāāāā ā
ā 680+ 500+ Million ā
ā ā
ā AVERAGE LATENCY SUCCESS RATE ā
ā āāāāāāāāāāāāāāā āāāāāāāāāāāā ā
ā ~270 milliseconds 99.2% ā
ā ā
ā GLOBAL SHARE COUNTRIES ACCEPTING UPI ā
ā āāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāā ā
ā 50% of world's 8+ (Singapore, UAE, France, ā
ā digital transactions Nepal, Bhutan, Sri Lanka...) ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
For context:
⢠UPI processes MORE transactions than Visa and Mastercard COMBINED in India
⢠A tea vendor in rural India uses the same system as a Fortune 500 company
⢠Transactions as small as ā¹1 (1 cent) work with the same reliability as ā¹1 Crore
⢠The system operates 24/7/365 with no maintenance windows
This is the system we'll design today.
The Interview Begins
You're interviewing at a fintech company. The principal architect draws on the whiteboard:
Interviewer: "India's UPI is considered one of the greatest achievements in financial technology. Countries around the world are trying to replicate it. Today, I want you to design a system like UPI from scratch."
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā Design a Real-Time Inter-Bank Payment System ā
ā ā
ā Build the infrastructure that enables instant money transfers ā
ā between ANY two bank accounts in a country using just a phone. ā
ā ā
ā Requirements: ā
ā ⢠Support 500+ banks with different legacy systems ā
ā ⢠Handle 600+ million transactions per day ā
ā ⢠Complete each transaction in < 2 seconds end-to-end ā
ā ⢠99.9% availability (< 9 hours downtime/year) ā
ā ⢠Zero tolerance for money loss (atomic transactions) ā
ā ⢠Work on basic smartphones with 2G/3G connectivity ā
ā ⢠Support both P2P (person-to-person) and P2M (person-to-merchant) ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Interviewer: "This is arguably one of the hardest system design problems ā you're building financial infrastructure for a nation. Take your time."
Phase 1: Requirements Clarification
You: "Before I start, let me understand the constraints better."
Your Questions
You: "First, what's the relationship between banks and this central system? Do banks connect directly, or through intermediaries?"
Interviewer: "Banks connect through the central switch ā NPCI in India's case. They don't talk to each other directly. The central system orchestrates everything."
You: "What about the mobile apps? Can any app connect to the system?"
Interviewer: "Apps must be approved and must partner with a bank. We call these Payment Service Providers or PSPs. PhonePe partners with Yes Bank, Google Pay with multiple banks. The app itself doesn't hold money ā it's just an interface."
You: "How do users identify each other? Bank account numbers are long and error-prone."
Interviewer: "Great observation. UPI solved this with Virtual Payment Addresses ā like an email for money. username@bankname. The system maps this to actual bank accounts."
You: "What happens if a transaction fails mid-way? Say money is debited but not credited?"
Interviewer: "This is critical. The system MUST be atomic. Either the full transaction succeeds, or it's completely rolled back. Users cannot lose money due to technical failures."
You: "What's the peak traffic pattern? Is it bursty?"
Interviewer: "Very bursty. Evening hours see 2-3x average traffic. Festival seasons like Diwali can see 5x normal load. The system must handle these gracefully."
Requirements Summary
Functional Requirements:
1. USER MANAGEMENT
⢠Register users via mobile number + bank account
⢠Create and manage Virtual Payment Addresses (VPAs)
⢠Link multiple bank accounts to one app
⢠Two-factor authentication (device binding + PIN)
2. PAYMENT OPERATIONS
⢠Push payments (I send money to you)
⢠Pull payments / Collect requests (I request money from you)
⢠QR code payments (scan and pay)
⢠Recurring payments (autopay/mandates)
3. TRANSACTION PROCESSING
⢠Real-time debit from sender's bank
⢠Real-time credit to receiver's bank
⢠Transaction status tracking
⢠Refund/reversal handling
4. BANK INTEGRATION
⢠Standardized APIs for all banks
⢠VPA to account resolution
⢠Balance inquiry (with consent)
⢠Account validation
5. SETTLEMENT
⢠Net settlement between banks (periodic)
⢠Reconciliation and dispute handling
⢠Audit trail for compliance
Non-Functional Requirements:
SCALE
⢠600+ million transactions/day
⢠20+ billion transactions/month
⢠500+ million active users
⢠680+ participating banks
LATENCY
⢠End-to-end: < 2 seconds (p99)
⢠NPCI switch processing: < 300ms
⢠Bank response time: < 1 second
AVAILABILITY
⢠99.9% uptime (8.7 hours downtime/year max)
⢠24/7/365 operation
⢠No scheduled maintenance windows
CONSISTENCY
⢠ACID transactions (money can't be lost)
⢠Exactly-once semantics
⢠Atomic debit-credit operations
SECURITY
⢠End-to-end encryption
⢠Device binding
⢠Multi-factor authentication
⢠Fraud detection in real-time
Phase 2: Back of the Envelope Estimation
You: "Let me work through the numbers..."
Traffic Calculations
TRANSACTIONS PER SECOND
Daily transactions: 640,000,000
Seconds per day: 86,400
Average TPS: ~7,400 TPS
Peak multiplier: 3-5x (evenings, festivals)
Peak TPS: ~25,000-35,000 TPS
Per transaction, multiple operations:
āāā VPA resolution: 1 lookup
āāā Sender bank call: 1 API call
āāā Receiver bank call: 1 API call
āāā Audit logging: 1-2 writes
āāā Notifications: 2 pushes
Effective operations/second: ~150,000+ at peak
Data Volume
STORAGE REQUIREMENTS
Per transaction record:
āāā Transaction ID: 36 bytes (UUID)
āāā Sender VPA: 50 bytes
āāā Receiver VPA: 50 bytes
āāā Amount: 8 bytes
āāā Timestamps: 16 bytes
āāā Status: 4 bytes
āāā Bank references: 100 bytes
āāā Metadata: 200 bytes
āāā Total: ~500 bytes
Daily storage:
āāā Transactions: 640M Ć 500B = 320 GB/day
āāā Audit logs: ~500 GB/day
āāā Total: ~800 GB/day
Annual storage: ~300 TB/year
7-year retention: ~2 PB
Infrastructure Estimates
COMPUTE REQUIREMENTS
At 25,000 TPS peak, assuming each server handles 1,000 TPS:
āāā API servers: 25+ servers (with redundancy: 50+)
āāā Database: Clustered, sharded
āāā Cache: Distributed Redis cluster
āāā Message queues: High-throughput Kafka cluster
NETWORK
āāā Connections to 680+ banks
āāā Each bank: dedicated secure link
āāā Geographic distribution: Multiple data centers
āāā Bandwidth: Several Gbps
Phase 3: High-Level Architecture
You: "Let me draw how UPI actually works. It's a beautiful layered architecture."
The Three-Layer Cake
UPI ARCHITECTURE: THE THREE-LAYER CAKE
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā LAYER 1: USER INTERFACE ā
ā āāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā ā
ā ā PhonePe ā ā Google Pay ā ā Paytm ā ā BHIM ā ā
ā ā (App) ā ā (App) ā ā (App) ā ā (App) ā ā
ā āāāāāāāā¬āāāāāāā āāāāāāāā¬āāāāāāā āāāāāāāā¬āāāāāāā āāāāāāāā¬āāāāāāā ā
ā ā ā ā ā ā
ā āāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā¼ ā¼ ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā PSP (Payment Service Providers) ā ā
ā ā ā ā
ā ā Apps must partner with a bank (PSP Bank) to access UPI ā ā
ā ā PhonePe ā Yes Bank Google Pay ā Multiple banks ā ā
ā ā The PSP handles: User onboarding, VPA creation, UI/UX ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā LAYER 2: NPCI SWITCH ā
ā āāāāāāāāāāāāāāāāāāāā ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā NPCI UPI PLATFORM ā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā ā
ā ā ā VPA ā ā Transaction ā ā Fraud ā ā ā
ā ā ā Mapper ā ā Router ā ā Detection ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā ā
ā ā ā Settlement ā ā Audit ā ā Dispute ā ā ā
ā ā ā Engine ā ā Trail ā ā Resolution ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā ā
ā ā ā ā
ā ā The brain of UPI: Routes transactions between banks ā ā
ā ā NPCI NEVER holds money ā only routes information ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā LAYER 3: BANKING LAYER ā
ā āāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā āāāāāāāāāāā āāāāāāāāāāā āāāāāāāāāāā āāāāāāāāāāā āāāāāāāāāāā ā
ā ā SBI ā ā HDFC ā ā ICICI ā ā Axis ā ā 680+ ā ā
ā ā Bank ā ā Bank ā ā Bank ā ā Bank ā ā Banks ā ā
ā āāāāāā¬āāāāā āāāāāā¬āāāāā āāāāāā¬āāāāā āāāāāā¬āāāāā āāāāāā¬āāāāā ā
ā ā ā ā ā ā ā
ā āāāāāāāāāāāāā“āāāāāāāāāāāā“āāāāāāāāāāāā“āāāāāāāāāāāā ā
ā ā ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā IMPS (Immediate Payment Service) ā ā
ā ā ā ā
ā ā The settlement rail that actually moves money between banks ā ā
ā ā UPI transactions settle via IMPS under the hood ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
KEY INSIGHT:
⢠Money flows: Bank ā Bank (through IMPS)
⢠Information flows: App ā PSP ā NPCI ā Banks
⢠NPCI is the orchestrator, not a money holder
Transaction Flow
You: "Let me trace a ā¹500 payment from Alice to Bob..."
TRANSACTION FLOW: ALICE PAYS BOB ā¹500
Alice's Phone NPCI Banks
(PhonePe App) Switch
ā ā ā
ā ā INITIATE ā ā
ā āāāāāāāāāāāāāāāāāāāāāā¶ā ā
ā "Pay ā¹500 to ā ā
ā bob@okaxis" ā ā
ā + Alice's VPA ā ā
ā + Encrypted PIN ā ā
ā ā ā
ā ā ā” RESOLVE VPA ā
ā ā āāāāāāāāāāāāāāāāāāāāāā¶ā
ā ā "Who is bob@okaxis?" ā
ā ā ā
ā ā ⢠VPA RESPONSE ā
ā āāāāāāāāāāāāāāāāāāāāāāā ā
ā ā "Bob's account at ā
ā ā Axis Bank: XXXX1234" ā
ā ā ā
ā ā ⣠DEBIT REQUEST ā
ā ā āāāāāāāāāāāāāāāāāāāāāā¶ā
ā ā "Debit ā¹500 from ā HDFC
ā ā Alice at HDFC" ā (Alice's Bank)
ā ā ā
ā ā ⤠DEBIT RESPONSE ā
ā āāāāāāāāāāāāāāāāāāāāāāā ā
ā ā "Debited. Ref: ABC123"ā
ā ā ā
ā ā ā„ CREDIT REQUEST ā
ā ā āāāāāāāāāāāāāāāāāāāāāā¶ā
ā ā "Credit ā¹500 to ā AXIS
ā ā Bob at Axis" ā (Bob's Bank)
ā ā ā
ā ā ⦠CREDIT RESPONSE ā
ā āāāāāāāāāāāāāāāāāāāāāāā ā
ā ā "Credited. Ref: XYZ789"ā
ā ā ā
ā ā§ SUCCESS ā ā
āāāāāāāāāāāāāāāāāāāāāāā ā ā
ā "Payment complete! ā ā
ā Ref: TXN123456" ā ā
ā ā ā
ā¼ ā¼ ā¼
TOTAL TIME: < 2 seconds
PARALLEL ACTIONS:
⢠Audit log written at each step
⢠Fraud check runs during step ā
⢠Push notifications sent to both Alice and Bob
⢠Settlement record created for bank reconciliation
Phase 4: Deep Dives
Deep Dive 1: Virtual Payment Address (VPA) Resolution
Week 1 concepts: Partitioning, lookup optimization. Week 4 concepts: Caching.
You: "VPA resolution is called for EVERY transaction. With 640 million daily transactions, this lookup must be blazing fast."
The Challenge:
VPA RESOLUTION CHALLENGE
500+ million VPAs like:
āāā alice@okhdfc
āāā bob@okaxis
āāā merchant@paytm
āāā 9876543210@ybl
āāā ...
Each VPA maps to:
āāā Bank code
āāā Account number (encrypted)
āāā Account holder name
āāā Status (active/blocked)
āāā Metadata
Requirements:
āāā Lookup latency: < 10ms
āāā 100% accuracy (wrong mapping = money to wrong person!)
āāā Real-time updates (user changes bank)
āāā Handle 50,000+ lookups/second at peak
The Solution:
VPA MAPPER ARCHITECTURE
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā VPA RESOLUTION FLOW ā
ā ā
ā ā
ā VPA: bob@okaxis ā
ā ā ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāāāāā ā
ā ā PARSE HANDLE ā Extract: handle="bob", suffix="okaxis" ā
ā āāāāāāāāāā¬āāāāāāāāā ā
ā ā ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā SUFFIX LOOKUP āāāāāā¶ā BANK REGISTRY ā ā
ā ā ā ā ā ā
ā ā "okaxis" ā Axis ā ā okhdfc ā HDFC ā ā
ā ā Bank Code ā ā okaxis ā Axis ā ā
ā āāāāāāāāāā¬āāāāāāāāā ā paytm ā Paytm ā ā
ā ā ā ybl ā Yes Bank ā ā
ā ā āāāāāāāāāāāāāāāāāāā ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāāāāā ā
ā ā CACHE CHECK ā ā
ā ā (Redis) ā ā
ā āāāāāāāāāā¬āāāāāāāāā ā
ā ā ā
ā Cache Hit? ā
ā ā ā ā
ā Yes No ā
ā ā ā ā
ā ā ā¼ ā
ā ā āāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā ā QUERY BANK āāāāāā¶ā Axis Bank ā ā
ā ā ā (Real-time) ā ā VPA Database ā ā
ā ā āāāāāāāāāā¬āāāāāāāāā āāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā ā Update cache ā
ā ā ā ā
ā ā¼ ā¼ ā
ā āāāāāāāāāāāāāāāāāāā ā
ā ā RETURN ACCOUNT ā ā
ā ā DETAILS ā ā
ā ā ā ā
ā ā Bank: Axis ā ā
ā ā Account: ***234 ā ā
ā ā Name: Bob Kumar ā ā
ā āāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
KEY DESIGN DECISIONS:
1. DISTRIBUTED OWNERSHIP
Each bank owns their VPA namespace (suffix)
NPCI doesn't store all VPAs ā banks do
This scales naturally as banks handle their own data
2. CACHING STRATEGY
Hot VPAs (frequently used) cached at NPCI
TTL: 15-30 minutes
Invalidation: Banks push updates for VPA changes
3. HANDLE UNIQUENESS
handle@suffix is globally unique
Banks ensure uniqueness within their namespace
Cross-bank uniqueness handled by suffix differentiation
# vpa/resolver.py
"""
VPA Resolution Service
Maps Virtual Payment Addresses to actual bank accounts.
This is the most critical lookup in the entire system.
"""
from dataclasses import dataclass
from typing import Optional, Tuple
from datetime import datetime, timedelta
import hashlib
@dataclass
class AccountDetails:
"""Resolved account information."""
bank_code: str
account_number_masked: str # Only last 4 digits visible
account_holder_name: str
ifsc_code: str
is_active: bool
verified_at: datetime
@dataclass
class VPAResolutionResult:
"""Result of VPA resolution."""
success: bool
account: Optional[AccountDetails] = None
error_code: Optional[str] = None
resolution_time_ms: float = 0
cache_hit: bool = False
class VPAResolver:
"""
Resolves VPAs to bank account details.
Design principles:
- Cache aggressively (VPAs don't change often)
- Fail fast on invalid formats
- Banks are the source of truth
"""
def __init__(
self,
cache, # Redis cluster
bank_registry, # Bank code ā Bank API mapping
bank_gateway, # Gateway to call bank APIs
metrics
):
self.cache = cache
self.registry = bank_registry
self.gateway = bank_gateway
self.metrics = metrics
# Cache settings
self.cache_ttl = timedelta(minutes=30)
self.negative_cache_ttl = timedelta(minutes=5)
async def resolve(self, vpa: str) -> VPAResolutionResult:
"""
Resolve a VPA to account details.
VPA format: handle@suffix
Example: alice@okhdfc, 9876543210@ybl
"""
start_time = datetime.utcnow()
# Step 1: Parse and validate VPA format
parsed = self._parse_vpa(vpa)
if not parsed:
return VPAResolutionResult(
success=False,
error_code="INVALID_VPA_FORMAT"
)
handle, suffix = parsed
# Step 2: Get bank code from suffix
bank_code = self.registry.get_bank_for_suffix(suffix)
if not bank_code:
return VPAResolutionResult(
success=False,
error_code="UNKNOWN_VPA_SUFFIX"
)
# Step 3: Check cache
cache_key = f"vpa:{vpa.lower()}"
cached = await self.cache.get(cache_key)
if cached:
if cached == "NOT_FOUND":
return VPAResolutionResult(
success=False,
error_code="VPA_NOT_FOUND",
cache_hit=True
)
account = AccountDetails(**cached)
return VPAResolutionResult(
success=True,
account=account,
resolution_time_ms=self._elapsed_ms(start_time),
cache_hit=True
)
# Step 4: Query the bank
try:
account = await self.gateway.resolve_vpa(
bank_code=bank_code,
handle=handle,
suffix=suffix
)
if account:
# Cache the result
await self.cache.set(
cache_key,
account.__dict__,
ttl=self.cache_ttl
)
return VPAResolutionResult(
success=True,
account=account,
resolution_time_ms=self._elapsed_ms(start_time),
cache_hit=False
)
else:
# Cache negative result (VPA doesn't exist)
await self.cache.set(
cache_key,
"NOT_FOUND",
ttl=self.negative_cache_ttl
)
return VPAResolutionResult(
success=False,
error_code="VPA_NOT_FOUND",
resolution_time_ms=self._elapsed_ms(start_time)
)
except BankTimeoutError:
return VPAResolutionResult(
success=False,
error_code="BANK_TIMEOUT"
)
except BankUnavailableError:
return VPAResolutionResult(
success=False,
error_code="BANK_UNAVAILABLE"
)
def _parse_vpa(self, vpa: str) -> Optional[Tuple[str, str]]:
"""Parse VPA into handle and suffix."""
if not vpa or '@' not in vpa:
return None
parts = vpa.lower().strip().split('@')
if len(parts) != 2:
return None
handle, suffix = parts
# Validate handle (alphanumeric, 3-50 chars)
if not handle or len(handle) < 3 or len(handle) > 50:
return None
# Validate suffix (registered bank suffix)
if not suffix or len(suffix) < 2 or len(suffix) > 20:
return None
return handle, suffix
def _elapsed_ms(self, start: datetime) -> float:
return (datetime.utcnow() - start).total_seconds() * 1000
Deep Dive 2: Atomic Transactions ā The Heart of Trust
Week 2 concepts: Idempotency, failure handling. Week 5 concepts: Distributed transactions, Saga pattern.
You: "The most critical requirement: money cannot disappear. If I debit Alice but fail to credit Bob, Alice must get her money back. Always."
The Challenge:
THE ATOMICITY CHALLENGE
Happy path:
ā Debit Alice (HDFC): ā¹500 ā
ā” Credit Bob (Axis): ā¹500 ā
ā Success!
Failure scenarios:
SCENARIO A: Credit fails after debit
ā Debit Alice: ā¹500 ā (money left Alice's account)
ā” Credit Bob: TIMEOUT ā (did it go through or not?)
ā UNCERTAINTY! Alice lost ā¹500?
SCENARIO B: Network partition
ā Debit Alice: ā
ā” Credit Bob: Request sent...
⢠Network dies
⣠We don't know the outcome!
ā UNCERTAINTY!
SCENARIO C: Duplicate request
ā User clicks "Pay" twice quickly
ā” Two debit requests sent
ā DOUBLE DEBIT! Alice loses ā¹1000?
These scenarios CANNOT happen in a payment system.
The Solution:
UPI'S TRANSACTION STATE MACHINE
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā TRANSACTION STATE MACHINE ā
ā ā
ā ā
ā āāāāāāāāāāāāāāāā ā
ā ā CREATED ā ā
ā ā ā ā
ā āāāāāāāā¬āāāāāāāā ā
ā ā ā
ā ā Validation passed ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāā ā
ā ā PENDING ā ā
ā ā ā ā
ā āāāāāāāā¬āāāāāāāā ā
ā ā ā
ā ā Send to remitter bank ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā
ā ā FAILED āāāāāāāā DEBIT āāāāāāā¶ā DEBITED ā ā
ā ā ā ā INITIATED ā ā ā ā
ā ā (No debit ā ā ā ā (Money left ā ā
ā ā happened) ā āāāāāāāāāāāāāāāā ā sender) ā ā
ā āāāāāāāāāāāāāāāā āāāāāāāā¬āāāāāāāā ā
ā ā ā
ā ā Send to ā
ā ā beneficiary bank ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā
ā ā REVERSED āāāāāāāā CREDIT āāāāāāā¶ā COMPLETED ā ā
ā ā ā ā INITIATED ā ā ā ā
ā ā (Money back ā ā ā ā (Money ā ā
ā ā to sender) ā āāāāāāāāāāāāāāāā ā received) ā ā
ā āāāāāāāāāāāāāāāā ā āāāāāāāāāāāāāāāā ā
ā ā² ā ā
ā ā ā Credit timeout/failure ā
ā ā ā¼ ā
ā ā āāāāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāā DEEMED ā ā
ā Auto-reverse ā SUCCESS ā ā
ā after T+2 ā ā ā
ā ā (Uncertain ā ā
ā ā state) ā ā
ā āāāāāāāāāāāāāāāā ā
ā ā
ā DEEMED SUCCESS: Bank didn't respond in time. ā
ā Settlement happens, if credit actually failed, auto-reversal at T+2 ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# transaction/processor.py
"""
Transaction Processing with Atomic Guarantees.
This is the most critical code in the entire system.
Money cannot be lost under any circumstances.
"""
from dataclasses import dataclass
from typing import Optional
from enum import Enum
from datetime import datetime, timedelta
import uuid
class TransactionState(Enum):
CREATED = "created"
PENDING = "pending"
DEBIT_INITIATED = "debit_initiated"
DEBITED = "debited"
CREDIT_INITIATED = "credit_initiated"
COMPLETED = "completed"
FAILED = "failed"
DEEMED_SUCCESS = "deemed_success"
REVERSED = "reversed"
@dataclass
class Transaction:
"""A UPI transaction record."""
txn_id: str
sender_vpa: str
receiver_vpa: str
amount: int # In paise (smallest unit)
state: TransactionState
created_at: datetime
updated_at: datetime
# Bank references
sender_bank_ref: Optional[str] = None
receiver_bank_ref: Optional[str] = None
# For idempotency
idempotency_key: Optional[str] = None
# Reversal tracking
reversal_initiated: bool = False
reversal_completed: bool = False
class TransactionProcessor:
"""
Processes UPI transactions with atomic guarantees.
Key principles:
1. IDEMPOTENCY: Same request = same result (no double-debit)
2. ATOMICITY: Either complete success or complete rollback
3. DURABILITY: State persisted before any bank call
4. RECOVERABILITY: Can resume from any failure point
"""
def __init__(
self,
db, # Transaction database
bank_gateway, # Bank API gateway
reversal_queue, # Queue for failed transactions
audit_log
):
self.db = db
self.gateway = bank_gateway
self.reversal_queue = reversal_queue
self.audit = audit_log
# Timeouts
self.debit_timeout = timedelta(seconds=30)
self.credit_timeout = timedelta(seconds=30)
async def process(
self,
sender_vpa: str,
receiver_vpa: str,
amount: int,
idempotency_key: str
) -> Transaction:
"""
Process a payment transaction.
CRITICAL: This method must be idempotent.
Same idempotency_key = same result, always.
"""
# STEP 0: Check idempotency
existing = await self.db.get_by_idempotency_key(idempotency_key)
if existing:
# Return existing result (no reprocessing)
await self.audit.log("IDEMPOTENT_RETURN", existing.txn_id)
return existing
# STEP 1: Create transaction record FIRST
txn = Transaction(
txn_id=str(uuid.uuid4()),
sender_vpa=sender_vpa,
receiver_vpa=receiver_vpa,
amount=amount,
state=TransactionState.CREATED,
created_at=datetime.utcnow(),
updated_at=datetime.utcnow(),
idempotency_key=idempotency_key
)
# Persist BEFORE any bank call
await self.db.save(txn)
await self.audit.log("TXN_CREATED", txn.txn_id)
try:
# STEP 2: Initiate debit
txn.state = TransactionState.DEBIT_INITIATED
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
debit_result = await self.gateway.debit(
vpa=sender_vpa,
amount=amount,
txn_ref=txn.txn_id,
timeout=self.debit_timeout
)
if not debit_result.success:
# Debit failed cleanly ā no money moved
txn.state = TransactionState.FAILED
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("DEBIT_FAILED", txn.txn_id,
debit_result.error)
return txn
# STEP 3: Debit succeeded ā record it
txn.state = TransactionState.DEBITED
txn.sender_bank_ref = debit_result.bank_reference
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("DEBIT_SUCCESS", txn.txn_id)
# STEP 4: Initiate credit
# CRITICAL: From this point, we MUST either complete or reverse
txn.state = TransactionState.CREDIT_INITIATED
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
credit_result = await self.gateway.credit(
vpa=receiver_vpa,
amount=amount,
txn_ref=txn.txn_id,
timeout=self.credit_timeout
)
if credit_result.success:
# SUCCESS! Transaction complete
txn.state = TransactionState.COMPLETED
txn.receiver_bank_ref = credit_result.bank_reference
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("TXN_COMPLETED", txn.txn_id)
return txn
elif credit_result.status == "TIMEOUT":
# UNCERTAINTY: We don't know if credit happened
# Mark as DEEMED_SUCCESS ā settlement will clarify
txn.state = TransactionState.DEEMED_SUCCESS
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("TXN_DEEMED_SUCCESS", txn.txn_id)
# Schedule reconciliation check
await self.reversal_queue.schedule_check(
txn.txn_id,
check_at=datetime.utcnow() + timedelta(hours=24)
)
return txn
else:
# Credit FAILED ā must reverse the debit
await self._initiate_reversal(txn)
return txn
except Exception as e:
# Unexpected error ā check state and recover
await self.audit.log("TXN_ERROR", txn.txn_id, str(e))
await self._handle_error(txn, e)
raise
async def _initiate_reversal(self, txn: Transaction):
"""
Reverse a failed transaction.
Credit the debited amount back to sender.
"""
txn.reversal_initiated = True
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("REVERSAL_INITIATED", txn.txn_id)
# Queue for reversal (handled by separate process)
await self.reversal_queue.enqueue(txn.txn_id)
async def process_reversal(self, txn_id: str):
"""
Execute reversal ā credit money back to sender.
Called by reversal worker.
"""
txn = await self.db.get(txn_id)
if txn.reversal_completed:
return # Already reversed
# Credit back to sender
reversal_result = await self.gateway.credit(
vpa=txn.sender_vpa,
amount=txn.amount,
txn_ref=f"REV-{txn.txn_id}",
timeout=self.credit_timeout
)
if reversal_result.success:
txn.state = TransactionState.REVERSED
txn.reversal_completed = True
txn.updated_at = datetime.utcnow()
await self.db.save(txn)
await self.audit.log("REVERSAL_COMPLETED", txn.txn_id)
else:
# Reversal failed ā retry later
# This is a critical alert scenario
await self.audit.log("REVERSAL_FAILED", txn.txn_id,
reversal_result.error)
await self.reversal_queue.schedule_retry(
txn.txn_id,
retry_at=datetime.utcnow() + timedelta(minutes=15)
)
Interviewer: "What about the 'deemed success' state? That seems risky."
You: "Great catch. Here's how reconciliation handles it..."
DEEMED SUCCESS RECONCILIATION
Scenario: We debited Alice, tried to credit Bob, got TIMEOUT
At NPCI level:
āāā Transaction marked DEEMED_SUCCESS
āāā We don't know if Bob got money
āāā Settlement file sent to banks includes this transaction
At Bank level (T+1 reconciliation):
āāā Bank compares settlement file with actual credits
āāā If credit happened: Mark as COMPLETED
āāā If credit NOT happened: Mark as FAILED ā Auto-reversal
Timing:
āāā T+0: Transaction happens, deemed success
āāā T+1: Banks reconcile, report actual status
āāā T+2: NPCI updates final status
āāā T+2: If failed, reversal initiated automatically
This is why UPI guidelines say:
"If money is debited but not credited, it will be
automatically reversed within 5 business days"
In practice, it's usually resolved within 24-48 hours.
Deep Dive 3: Bank Integration at Scale
Week 2 concepts: Timeouts, circuit breakers. Week 3 concepts: Message queues.
You: "With 680+ banks, each with different legacy systems, bank integration is a massive challenge."
BANK INTEGRATION ARCHITECTURE
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā NPCI ā BANK GATEWAY ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā BANK ADAPTER LAYER ā ā
ā ā ā ā
ā ā Every bank exposes standard UPI APIs, but internal ā ā
ā ā implementations vary wildly. The adapter handles this. ā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā ā ā
ā ā ā SBI Adapter ā ā HDFC Adapterā ā Axis Adapterā ... ā ā
ā ā ā ā ā ā ā ā ā ā
ā ā ā Handles: ā ā Handles: ā ā Handles: ā ā ā
ā ā ā - SBI's ā ā - HDFC's ā ā - Axis's ā ā ā
ā ā ā quirks ā ā quirks ā ā quirks ā ā ā
ā ā ā - Retry ā ā - Retry ā ā - Retry ā ā ā
ā ā ā logic ā ā logic ā ā logic ā ā ā
ā ā ā - Timeouts ā ā - Timeouts ā ā - Timeouts ā ā ā
ā ā āāāāāāāā¬āāāāāāā āāāāāāāā¬āāāāāāā āāāāāāāā¬āāāāāāā ā ā
ā ā ā ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāā ā ā
ā ā ā ā ā
ā ā ā¼ ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā CIRCUIT BREAKER LAYER ā ā ā
ā ā ā ā ā ā
ā ā ā Per-bank circuit breakers prevent cascade failures ā ā ā
ā ā ā ā ā ā
ā ā ā SBI: [CLOSED] āāāāāāāāāā (healthy) ā ā ā
ā ā ā HDFC: [CLOSED] āāāāāāāāāā (healthy) ā ā ā
ā ā ā Axis: [OPEN] āāāāāāāāāā (failing, skip for 30s) ā ā ā
ā ā ā ICICI: [HALF] āāāāāāāāāā (testing recovery) ā ā ā
ā ā ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā ā ā
ā ā ā¼ ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā SECURE COMMUNICATION LAYER ā ā ā
ā ā ā ā ā ā
ā ā ā - HTTPS with mutual TLS ā ā ā
ā ā ā - Request/Response signing ā ā ā
ā ā ā - Encryption of sensitive data ā ā ā
ā ā ā - IP whitelisting ā ā ā
ā ā ā - Dedicated leased lines to major banks ā ā ā
ā ā ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# bank/gateway.py
"""
Bank Gateway: Unified interface to 680+ banks.
Each bank is different. This gateway provides a
consistent interface while handling per-bank quirks.
"""
from dataclasses import dataclass
from typing import Dict, Optional
from datetime import datetime, timedelta
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Bank failing, don't try
HALF_OPEN = "half" # Testing if bank recovered
@dataclass
class BankConfig:
"""Configuration for a bank."""
bank_code: str
endpoint: str
timeout_ms: int = 30000
# Circuit breaker settings
failure_threshold: int = 5
recovery_timeout_s: int = 30
# Bank-specific quirks
requires_padding: bool = False
amount_in_rupees: bool = False # Some banks want rupees, not paise
legacy_xml_format: bool = False
class BankGateway:
"""
Gateway for all bank operations.
"""
def __init__(
self,
http_client,
bank_configs: Dict[str, BankConfig],
metrics
):
self.http = http_client
self.configs = bank_configs
self.metrics = metrics
# Circuit breakers per bank
self.circuits: Dict[str, CircuitBreaker] = {
code: CircuitBreaker(config)
for code, config in bank_configs.items()
}
async def debit(
self,
bank_code: str,
account_ref: str,
amount: int,
txn_ref: str
) -> 'BankResponse':
"""
Debit an account at a bank.
Amount is in paise (smallest unit).
"""
config = self.configs.get(bank_code)
if not config:
return BankResponse(
success=False,
error_code="UNKNOWN_BANK"
)
# Check circuit breaker
circuit = self.circuits[bank_code]
if not circuit.can_execute():
self.metrics.increment("bank_circuit_open", bank_code)
return BankResponse(
success=False,
error_code="BANK_CIRCUIT_OPEN"
)
try:
# Build request (handle bank-specific formats)
request = self._build_debit_request(
config, account_ref, amount, txn_ref
)
# Make the call
start = datetime.utcnow()
response = await self.http.post(
config.endpoint + "/debit",
json=request,
timeout=config.timeout_ms / 1000
)
latency = (datetime.utcnow() - start).total_seconds() * 1000
# Record metrics
self.metrics.record_latency("bank_debit", bank_code, latency)
# Parse response
result = self._parse_response(config, response)
if result.success:
circuit.record_success()
else:
circuit.record_failure()
return result
except TimeoutError:
circuit.record_failure()
self.metrics.increment("bank_timeout", bank_code)
return BankResponse(
success=False,
error_code="TIMEOUT",
status="TIMEOUT"
)
except Exception as e:
circuit.record_failure()
self.metrics.increment("bank_error", bank_code)
return BankResponse(
success=False,
error_code="BANK_ERROR",
error_message=str(e)
)
def _build_debit_request(
self,
config: BankConfig,
account_ref: str,
amount: int,
txn_ref: str
) -> dict:
"""Build bank-specific request format."""
# Handle amount format (paise vs rupees)
if config.amount_in_rupees:
amount_value = amount / 100
else:
amount_value = amount
if config.legacy_xml_format:
# Some old banks still use XML
return {
"xml_payload": self._build_xml(
account_ref, amount_value, txn_ref
)
}
return {
"account_reference": account_ref,
"amount": amount_value,
"transaction_reference": txn_ref,
"timestamp": datetime.utcnow().isoformat()
}
class CircuitBreaker:
"""
Circuit breaker for bank connections.
Prevents cascade failures when a bank is down.
"""
def __init__(self, config: BankConfig):
self.config = config
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure: Optional[datetime] = None
self.success_count = 0
def can_execute(self) -> bool:
"""Check if we can make a request to this bank."""
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
# Check if recovery timeout passed
if self.last_failure:
elapsed = (datetime.utcnow() - self.last_failure).total_seconds()
if elapsed > self.config.recovery_timeout_s:
self.state = CircuitState.HALF_OPEN
self.success_count = 0
return True
return False
if self.state == CircuitState.HALF_OPEN:
return True
return False
def record_success(self):
"""Record a successful request."""
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= 3: # 3 successes to close
self.state = CircuitState.CLOSED
self.failure_count = 0
else:
self.failure_count = 0
def record_failure(self):
"""Record a failed request."""
self.failure_count += 1
self.last_failure = datetime.utcnow()
if self.state == CircuitState.HALF_OPEN:
# Back to open
self.state = CircuitState.OPEN
self.success_count = 0
elif self.failure_count >= self.config.failure_threshold:
self.state = CircuitState.OPEN
Deep Dive 4: Security ā The Trust Foundation
Week 9 concepts: Security, authentication, fraud detection.
You: "UPI handles ā¹20+ trillion monthly. Security isn't optional ā it's existential."
UPI SECURITY ARCHITECTURE
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā MULTI-LAYER SECURITY ā
ā ā
ā LAYER 1: DEVICE BINDING ā
ā āāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ⢠UPI PIN is bound to specific device ā
ā ⢠Device fingerprint (IMEI, hardware ID) ā
ā ⢠SIM binding (mobile number verification) ā
ā ⢠If device changes, re-registration required ā
ā ā
ā LAYER 2: TWO-FACTOR AUTHENTICATION ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā Factor 1: Something you HAVE ā
ā ⢠The registered mobile device ā
ā ⢠The SIM card with registered number ā
ā ā
ā Factor 2: Something you KNOW ā
ā ⢠4-6 digit UPI PIN (set by user) ā
ā ⢠PIN encrypted on device, never transmitted in clear ā
ā ā
ā LAYER 3: ENCRYPTION ā
ā āāāāāāāāāāāāāāāāāāāāā ā
ā ⢠HTTPS/TLS for all communication ā
ā ⢠UPI PIN encrypted using PBKDF2 (600,000 iterations) ā
ā ⢠PIN verification in bank's HSM (Hardware Security Module) ā
ā ⢠End-to-end encryption for sensitive data ā
ā ā
ā LAYER 4: TRANSACTION SIGNING ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ⢠Each transaction signed with digital signature ā
ā ⢠Prevents tampering in transit ā
ā ⢠Non-repudiation for disputes ā
ā ā
ā LAYER 5: REAL-TIME FRAUD DETECTION ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ⢠Velocity checks (too many transactions too fast) ā
ā ⢠Amount anomaly detection ā
ā ⢠Geo-location checks (impossible travel) ā
ā ⢠Behavioral analysis ā
ā ⢠Block list matching ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
INFRASTRUCTURE SECURITY
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā NPCI DATA CENTERS ā
ā āāāāāāāāāāāāāāāāā ā
ā ā
ā ⢠Tier-IV certified (99.995% uptime) ā
ā ⢠Chennai + Hyderabad (geographically separated) ā
ā ⢠FIPS 140-2 Level 3 certified HSMs ā
ā ⢠Active-Active configuration ā
ā ⢠N+N redundancy ā
ā ā
ā ⢠Physical security: ā
ā āāā Biometric access control ā
ā āāā 24/7 security personnel ā
ā āāā CCTV surveillance ā
ā āāā Man-trap entries ā
ā ā
ā BANK CONNECTIONS ā
ā āāāāāāāāāāāāāāāā ā
ā ⢠Dedicated leased lines (not public internet) ā
ā ⢠Mutual TLS authentication ā
ā ⢠IP whitelisting ā
ā ⢠Regular security audits ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# security/fraud_detector.py
"""
Real-Time Fraud Detection for UPI.
Must decide in < 50ms whether to allow a transaction.
"""
from dataclasses import dataclass
from typing import List, Tuple
from datetime import datetime, timedelta
@dataclass
class FraudSignals:
"""Signals used for fraud detection."""
user_id: str
device_id: str
amount: int
receiver_vpa: str
# Velocity
txn_count_1hr: int
txn_count_24hr: int
total_amount_24hr: int
# Device
is_new_device: bool
device_age_days: int
# Behavioral
is_new_receiver: bool
typical_amount: int
typical_time_of_day: List[int]
current_hour: int
# Location
device_location: str
usual_locations: List[str]
class FraudDetector:
"""
Real-time fraud detection.
Must be FAST (< 50ms) and ACCURATE (low false positives).
"""
def __init__(self, ml_model, rules_engine, blocklist):
self.model = ml_model
self.rules = rules_engine
self.blocklist = blocklist
async def evaluate(
self,
signals: FraudSignals
) -> Tuple[str, float, List[str]]:
"""
Evaluate fraud risk.
Returns: (decision, confidence, triggered_rules)
decision: "ALLOW", "BLOCK", "STEP_UP"
"""
triggered_rules = []
# RULE 1: Blocklist check (instant)
if await self.blocklist.is_blocked(signals.user_id):
return "BLOCK", 1.0, ["USER_BLOCKED"]
if await self.blocklist.is_blocked(signals.device_id):
return "BLOCK", 1.0, ["DEVICE_BLOCKED"]
# RULE 2: Velocity checks
if signals.txn_count_1hr > 10:
triggered_rules.append("HIGH_VELOCITY_1HR")
if signals.txn_count_24hr > 50:
triggered_rules.append("HIGH_VELOCITY_24HR")
if signals.total_amount_24hr > 100000_00: # ā¹1 lakh in paise
triggered_rules.append("HIGH_AMOUNT_24HR")
# RULE 3: Amount anomaly
if signals.amount > signals.typical_amount * 10:
triggered_rules.append("AMOUNT_ANOMALY")
# RULE 4: New device
if signals.is_new_device:
triggered_rules.append("NEW_DEVICE")
if signals.amount > 10000_00: # > ā¹10,000 on new device
triggered_rules.append("HIGH_AMOUNT_NEW_DEVICE")
# RULE 5: Unusual time
if signals.current_hour not in signals.typical_time_of_day:
triggered_rules.append("UNUSUAL_TIME")
# RULE 6: Location check
if signals.device_location not in signals.usual_locations:
triggered_rules.append("UNUSUAL_LOCATION")
# ML model for complex patterns
ml_score = await self.model.predict(signals)
# Decision logic
if ml_score > 0.9 or len(triggered_rules) > 3:
return "BLOCK", ml_score, triggered_rules
if ml_score > 0.7 or len(triggered_rules) > 1:
# Step-up: require additional verification
return "STEP_UP", ml_score, triggered_rules
return "ALLOW", 1 - ml_score, triggered_rules
Phase 5: Scaling and Edge Cases
Interviewer: "What happens during Diwali when everyone is sending money?"
You: "UPI handles 5x spikes during festivals. Here's how..."
Festival Traffic Management
DIWALI SCALE (5X NORMAL TRAFFIC)
Normal day:
āāā ~640 million transactions
āāā ~7,400 average TPS
āāā ~25,000 peak TPS
Diwali:
āāā ~3 billion transactions
āāā ~35,000 average TPS
āāā ~150,000+ peak TPS
āāā Concentrated in evening hours (7 PM - 11 PM)
PREPARATION (Weeks Before):
āāā Pre-scale infrastructure to 3x capacity
āāā Warm up caches with popular VPAs
āāā Notify banks to scale their systems
āāā Extended support staff on standby
āāā Runbooks reviewed and tested
DURING THE EVENT:
āāā Auto-scaling triggers at 60% capacity
āāā Non-critical features disabled (promotional notifications)
āāā Enhanced monitoring (5-second alert intervals)
āāā War room with all bank representatives
āāā Direct escalation paths to bank CTOs
GRACEFUL DEGRADATION:
If overwhelmed:
āāā Prioritize smaller transactions (more users served)
āāā Rate limit per-user (max 5 txn/minute)
āāā Queue non-urgent operations (mandate registrations)
āāā Return "Try again in few minutes" vs hard failure
Critical Edge Cases
EDGE CASE 1: Bank System Down
Problem: SBI (largest bank, 30% market share) goes down
Impact: 30% of transactions fail
Solution:
āāā Circuit breaker opens for SBI immediately
āāā Return clear error: "SBI temporarily unavailable"
āāā Pending transactions queued (if bank supports retry)
āāā Status page updated
āāā Auto-retry when circuit closes
āāā Transactions involving SBI gracefully rejected
EDGE CASE 2: NPCI Switch Partial Failure
Problem: One NPCI data center fails
Impact: 50% capacity lost
Solution:
āāā Active-Active setup in Chennai and Hyderabad
āāā Traffic automatically routes to healthy DC
āāā DNS TTL is low (60 seconds) for fast failover
āāā Data replicated synchronously between DCs
āāā RPO: 0 (no data loss), RTO: < 30 seconds
EDGE CASE 3: Duplicate Transaction Request
Problem: User's app times out, they retry, but first request succeeded
Impact: Double debit
Solution:
āāā Every transaction has idempotency key
āāā Generated on client: device_id + timestamp + amount + receiver
āāā NPCI checks idempotency before processing
āāā If duplicate: return original result
āāā No double processing possible
EDGE CASE 4: April 2025 Outage (Real Incident)
What happened:
āāā Banks were calling "Check Transaction Status" API excessively
āāā Some banks called for old transactions repeatedly
āāā NPCI didn't enforce rate limits on this API
āāā API flooded, entire system degraded
Lesson learned:
āāā Rate limit ALL APIs, not just transaction APIs
āāā Enforce guidelines at NPCI firewall, not just bank side
āāā Separate critical path APIs from status check APIs
āāā Circuit breaker for misbehaving banks
Phase 6: Monitoring and Operations
You: "For a system processing ā¹20 trillion monthly, monitoring isn't optional."
Key Metrics Dashboard
UPI OPERATIONS DASHBOARD
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā REAL-TIME HEALTH ā
ā āāāāāāāāāāāāāāāāā ā
ā ā
ā Transaction Rate Success Rate Latency (p99) ā
ā āāāāāāāāāāāāāāāāā āāāāāāāāāāā āāāāāāāāāā ā
ā 7,842 TPS 99.2% 847 ms ā
ā (Target: 10K) (Target: 99.0%) (Target: 1000ms) ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā BANK HEALTH MATRIX ā
ā āāāāāāāāāāāāāāāāāā ā
ā ā
ā Bank TPS Success Latency Circuit Issues ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā SBI 2,341 98.7% 423ms CLOSED Minor lag ā
ā HDFC 1,856 99.8% 287ms CLOSED Healthy ā
ā ICICI 1,234 99.5% 312ms CLOSED Healthy ā
ā Axis 987 94.2% 892ms HALF-OPEN HIGH LATENCY ā
ā Kotak 654 99.1% 345ms CLOSED Healthy ā
ā Yes Bank 543 99.4% 298ms CLOSED Healthy ā
ā ... ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā ALERTS (Last 1 hour) ā
ā āāāāāāāāāāāāāāāāāāāā ā
ā ā
ā š“ 14:32 - Axis Bank latency > 800ms (CRITICAL) ā
ā š” 14:28 - SBI error rate 1.3% (Warning) ā
ā š¢ 14:15 - Axis Bank circuit half-open (Info) ā
ā š¢ 13:45 - Traffic spike +20% (Auto-scaled) ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
SLOs for UPI
UPI SERVICE LEVEL OBJECTIVES
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā SLO 1: TRANSACTION SUCCESS RATE ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā Target: 99.0% of transactions succeed ā
ā Measurement: Successful / Total (excluding user errors) ā
ā Current: 99.2% ā
ā ā
ā Exclusions: ā
ā ⢠Insufficient balance (user error) ā
ā ⢠Wrong PIN (user error) ā
ā ⢠Account blocked (compliance) ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā SLO 2: END-TO-END LATENCY ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā Target: 99% of transactions complete in < 2 seconds ā
ā Measurement: Time from request received to response sent ā
ā Current: p99 = 1.2 seconds ā
ā ā
ā Breakdown: ā
ā ⢠NPCI processing: < 300ms ā
ā ⢠Bank response (each): < 800ms ā
ā ⢠Network overhead: < 200ms ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā SLO 3: AVAILABILITY ā
ā āāāāāāāāāāāāāāāāāāāā ā
ā Target: 99.9% uptime ā
ā Measurement: (Total time - Downtime) / Total time ā
ā Allowed downtime: 8.7 hours/year ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā SLO 4: MONEY SAFETY ā
ā āāāāāāāāāāāāāāāāāāāāā ā
ā Target: 100% of debited amounts credited or reversed ā
ā Measurement: No money stuck > 5 business days ā
ā Current: 99.99% resolved within 24 hours ā
ā ā
ā This is NON-NEGOTIABLE. Error budget = 0 ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Interview Conclusion
Interviewer: "Impressive depth. A few rapid-fire questions:"
Interviewer: "Why didn't India just adopt an existing system like Visa/Mastercard?"
You: "Three reasons:
- Cost: Card networks charge 1.5-3% per transaction. UPI is nearly free.
- Inclusion: Cards need credit checks, plastic production, POS terminals. UPI needs only a phone.
- Control: Critical financial infrastructure shouldn't depend on foreign companies.
The result: UPI enabled the chai vendor to accept digital payments for a ā¹10 tea."
Interviewer: "What's the biggest technical achievement of UPI?"
You: "Interoperability without centralized money holding. NPCI routes transactions but never touches the money. This means:
- No counterparty risk (NPCI can't go bankrupt with your money)
- Banks remain the regulated entities
- Scales infinitely (NPCI is just a switch)
- Any app works with any bank
This architecture is why countries worldwide are studying UPI."
Interviewer: "If you were to improve UPI today, what would you change?"
You: "Based on the April 2025 outage:
- Stricter rate limiting at NPCI level, not trusting banks to self-regulate
- Better isolation between critical transaction APIs and status check APIs
- More granular circuit breakers ā per-API, not just per-bank
- Chaos engineering ā regularly test failure scenarios in production"
Summary: Concepts Applied from 10-Week Course
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā CONCEPTS FROM 10-WEEK COURSE IN UPI DESIGN ā
ā ā
ā WEEK 1: DATA AT SCALE ā
ā āāā Partitioning: VPAs partitioned by bank suffix ā
ā āāā Replication: Multi-DC active-active setup ā
ā āāā Read optimization: VPA caching at NPCI level ā
ā ā
ā WEEK 2: FAILURE-FIRST DESIGN ā
ā āāā Timeouts: Strict timeouts for bank calls (30s) ā
ā āāā Circuit breakers: Per-bank failure isolation ā
ā āāā Idempotency: Transaction idempotency keys ā
ā āāā Retries: Smart retry with exponential backoff ā
ā ā
ā WEEK 3: MESSAGING & ASYNC ā
ā āāā Transactional outbox: Audit logging ā
ā āāā Dead letter queues: Failed reversal handling ā
ā āāā Event streaming: Transaction events for reconciliation ā
ā ā
ā WEEK 4: CACHING ā
ā āāā VPA resolution caching ā
ā āāā Bank configuration caching ā
ā āāā Negative caching: Non-existent VPAs ā
ā ā
ā WEEK 5: CONSISTENCY & COORDINATION ā
ā āāā Distributed transactions: Debit-then-credit with rollback ā
ā āāā State machine: Transaction lifecycle management ā
ā āāā Exactly-once semantics: Idempotency guarantees ā
ā ā
ā WEEK 9: SECURITY & COMPLIANCE ā
ā āāā Multi-factor authentication: Device + PIN ā
ā āāā Encryption: PBKDF2, HSM-based PIN verification ā
ā āāā Fraud detection: Real-time ML scoring ā
ā āāā Audit trail: Complete transaction logging ā
ā ā
ā WEEK 10: PRODUCTION READINESS ā
ā āāā SLOs: Success rate, latency, availability targets ā
ā āāā Observability: Per-bank health dashboards ā
ā āāā Capacity planning: Festival traffic handling ā
ā āāā Incident management: April 2025 outage learnings ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Why UPI Matters
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā WHY UPI IS A MARVEL OF ENGINEERING ā
ā ā
ā SCALE ā
ā āāāāā ā
ā ⢠50% of world's digital transactions ā
ā ⢠More than Visa + Mastercard combined (in India) ā
ā ⢠640+ million transactions DAILY ā
ā ā
ā INCLUSION ā
ā āāāāāāāāā ā
ā ⢠Works on ā¹3,000 smartphones ā
ā ⢠Works on 2G networks ā
ā ⢠ā¹1 transactions viable (no minimums) ā
ā ⢠Enabled 300 million+ previously unbanked Indians ā
ā ā
ā COST ā
ā āāāā ā
ā ⢠Zero cost to consumers ā
ā ⢠Near-zero cost to small merchants ā
ā ⢠Saved billions in card network fees ā
ā ā
ā INNOVATION ā
ā āāāāāāāāāā ā
ā ⢠VPA system (email for money) ā
ā ⢠Interoperable (any app, any bank) ā
ā ⢠Open standard (countries can adopt) ā
ā ⢠Built on existing bank infrastructure ā
ā ā
ā GLOBAL IMPACT ā
ā āāāāāāāāāāāāā ā
ā ⢠8+ countries accepting UPI ā
ā ⢠10+ countries studying UPI for adoption ā
ā ⢠Model for BIS cross-border payment initiatives ā
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā "UPI proved that with the right architecture, a developing nation ā
ā can leapfrog decades of financial infrastructure and build ā
ā something the developed world envies." ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Self-Assessment Checklist
After studying this case study, you should be able to:
Architecture:
- Explain the three-layer UPI architecture (Apps ā NPCI ā Banks)
- Design a VPA resolution system with caching
- Implement atomic transactions with rollback capability
Distributed Systems:
- Handle partial failures in multi-party transactions
- Implement circuit breakers for unreliable dependencies
- Design idempotency for payment systems
Scale:
- Calculate infrastructure needs for billion-transaction systems
- Plan for bursty traffic (festivals, events)
- Implement graceful degradation under load
Security:
- Design multi-factor authentication for payments
- Implement real-time fraud detection
- Understand HSM-based PIN verification
Operations:
- Define meaningful SLOs for payment systems
- Monitor multi-party systems (NPCI + 680 banks)
- Learn from production incidents (April 2025 outage)
Sources
Statistics and Data:
- NPCI Official UPI Statistics: https://www.npci.org.in/product/upi/product-statistics
- Business Standard - UPI December 2024 Data: https://www.business-standard.com/finance/news/upi-transactions-surge-to-record-16-73-bn-in-dec-value-at-rs-23-25-trn-125010100457_1.html
- DemandSage UPI Statistics 2025: https://www.demandsage.com/upi-statistics/
- Meetanshi UPI Statistics: https://meetanshi.com/blog/upi-statistics/
- GrabOn UPI Statistics: https://www.grabon.in/indulge/tech/upi-statistics/
Architecture and Technical Details:
- Wikipedia - Unified Payments Interface: https://en.wikipedia.org/wiki/Unified_Payments_Interface
- ByteByteGo - UPI Architecture: https://bytebytego.com/guides/unified-payments-interface-upi-in-india/
- GeeksforGeeks - Designing UPI System Design: https://www.geeksforgeeks.org/designing-upi-system-design/
- Medium - Deep Dive System Design of UPI: https://medium.com/@avinashkariya05910/deep-dive-system-design-of-upi-unified-payments-interface-eff3b0334b0d
- Brickendon Consulting - UPI Technical Overview: https://www.brickendon.com/insights/unified-payments-interface-upi/
- Dev.to - System Design UPI: https://dev.to/zeeshanali0704/system-design-upi-unified-payment-interface-2ng3
Infrastructure and Security:
- ITNews Asia - NPCI Data Center Modernization: https://www.itnews.asia/news/indias-npci-modernises-data-centres-using-kyndryls-cloud-services-592061
- NPCI Smart Data Center Press Release: https://www.npci.org.in/PDF/npci/press-releases/2020/NPCI_Press_Release-NPCI_to_launch_Smart_Data_Center_in_Hyderabad.pdf
- The420.in - UPI Security Infrastructure: https://the420.in/upi-digital-payments-infrastructure-security-npci-cpt-analysis/
- Blog - UPI Security Architecture Deep Dive: https://blog.akshanshjaiswal.com/the-upi-architecture-a-security-look
Settlement and Operations:
- BillCut - Settlement Latency Benchmarks: https://www.billcut.com/blogs/settlement-latency-benchmarks-whos-fastest/
- Razorpay - UPI Payout Processing: https://razorpay.com/blog/business-banking/payout-processing-imps-upi-transactions-deemed-success-npci/
- Inc42 - NPCI Real-Time Fix for UPI Failures: https://inc42.com/buzz/ncpi-working-on-real-time-fix-for-upi-transaction-failures/
- BIS Papers - Faster Digital Payments India: https://www.bis.org/publ/bppdf/bispap152_e_rh.pdf
Outage Analysis (April 2025):
- Wikipedia - UPI Outage Details: https://en.wikipedia.org/wiki/Unified_Payments_Interface
Further Reading
Official Documentation:
- NPCI Official Website: https://www.npci.org.in/
- NPCI UPI Product Page: https://www.npci.org.in/what-we-do/upi/product-overview
- RBI Payment Systems: https://www.rbi.org.in/Scripts/PaymentSystems_UM.aspx
- UPI Procedural Guidelines: https://yashada.org/yashada_2019/pdfs/e_library_cit/edpri_UPI_Procedural_Guidelines.pdf
Engineering Blogs and Technical Deep Dives:
- Razorpay Engineering Blog: https://razorpay.com/blog/ (Multiple articles on UPI integration)
- Paytm Engineering: https://paytm.com/blog/ (UPI transaction insights)
- ByteByteGo Newsletter: https://blog.bytebytego.com/ (System design breakdowns)
- LinkedIn Engineering Posts: Search "UPI Architecture" for practitioner insights
Research Papers and Reports:
- BIS Papers No. 152: Faster Digital Payments - Global and Regional Perspectives (India Chapter)
- NPCI White Papers: Available on NPCI website
- RBI Annual Reports: Digital payments statistics and trends
News and Industry Analysis:
- Medianama: https://www.medianama.com/ (Digital payments coverage)
- Economic Times Tech: https://economictimes.indiatimes.com/tech (Fintech news)
- Inc42: https://inc42.com/ (Startup and fintech coverage)
- LiveMint: https://www.livemint.com/ (Financial news)
Video Resources:
- NPCI YouTube Channel: Official explainers and announcements
- System Design Interview Videos: Search "UPI System Design" on YouTube
Books:
- "Designing Data-Intensive Applications" by Martin Kleppmann - Foundational concepts
- "System Design Interview" by Alex Xu - Interview preparation with similar patterns
Related Systems to Study:
- PIX (Brazil): Similar instant payment system
- FedNow (USA): US real-time payment system
- SEPA Instant (Europe): European instant payments
End of Bonus Problem 1: India's UPI
"A payment system that serves a billion people, handles trillions in transactions, and costs nothing to use. This is what engineering at scale looks like."
š¬ Public Discussion: Comments are visible to all users. Please be respectful and mindful of what you share.
Discussion (0)
Sign in to join the discussion