Week 7 Preview: Designing a Search System
System Design Mastery Series — Building Blocks Week
Week Overview
Theme: "Making data findable at scale"
This week we design search infrastructure from the ground up. Search is deceptively complex — it's not just "put data in Elasticsearch." Understanding search deeply means understanding inverted indexes, relevance scoring, query optimization, and the operational challenges of keeping indexes fresh and fast.
The System: A product search platform for an e-commerce site with 50M products, 10K queries per second, and sub-200ms latency requirements.
Why Search Systems Matter
Every application eventually needs search:
WITHOUT PROPER SEARCH
User: "red running shoes nike size 10"
SELECT * FROM products
WHERE name LIKE '%red%'
AND name LIKE '%running%'
AND name LIKE '%shoes%'
AND name LIKE '%nike%'
AND size = 10;
Problems:
├── Full table scan (50M rows)
├── No relevance ranking
├── No typo tolerance ("nikee" → 0 results)
├── No synonym handling ("sneakers" vs "shoes")
├── 30+ second query time
└── Database falls over
WITH SEARCH INFRASTRUCTURE
User: "red running shoes nike size 10"
Search engine returns in 50ms:
├── Nike Air Zoom Pegasus (red, size 10) — Score: 0.95
├── Nike Free Run (red, size 10) — Score: 0.91
├── Nike Revolution (crimson, size 10) — Score: 0.87
└── ... 234 more results
Features:
├── Instant results (inverted index)
├── Relevance ranking (BM25 + boosts)
├── Typo tolerance (fuzzy matching)
├── Synonym expansion
├── Faceted filtering
└── Autocomplete suggestions
What Makes Search Hard?
1. The Indexing Problem
WRITE PATH CHALLENGES
Source of Truth: PostgreSQL (50M products)
Search Index: Elasticsearch
Challenges:
├── How do you keep them in sync?
├── Product updated → How fast must search reflect it?
├── Bulk imports (1M new products) → Don't overwhelm the index
├── Schema changes → Reindex without downtime?
└── Deleted products → Must disappear from search immediately
2. The Relevance Problem
RELEVANCE IS SUBJECTIVE
Query: "apple"
User A (shopping for fruit):
Wants: Fresh apples, apple juice, apple pie
User B (shopping for electronics):
Wants: iPhone, MacBook, AirPods
Same query, completely different intent.
How do you rank results?
3. The Scale Problem
SCALE CHALLENGES
50M documents × 20 fields = 1B field values to index
10K queries/sec × 100ms = 1M concurrent query operations
Index size: 500GB+
Must fit in memory for speed
Questions:
├── How many shards?
├── How many replicas?
├── How do you handle hot queries?
└── What happens during reindexing?
Week 7 Learning Objectives
By the end of this week, you will be able to:
SEARCH FUNDAMENTALS
├── Explain how inverted indexes work
├── Understand tokenization, stemming, and analysis
├── Design document schemas for search
└── Choose between Elasticsearch, OpenSearch, Solr, or custom
INDEXING PIPELINES
├── Design CDC-based indexing (Debezium pattern)
├── Implement batch vs streaming indexing trade-offs
├── Handle schema evolution without downtime
└── Build reindexing strategies
QUERY OPTIMIZATION
├── Understand BM25 and TF-IDF scoring
├── Implement query-time boosting
├── Design filter vs query clauses
├── Optimize for latency at scale
ADVANCED FEATURES
├── Build autocomplete with edge n-grams
├── Implement "did you mean" suggestions
├── Design faceted search and aggregations
├── Handle multi-language search
OPERATIONS
├── Size clusters appropriately
├── Monitor search health
├── Handle index corruption and recovery
└── Plan capacity for growth
The System We're Building
Product Search Platform
┌─────────────────────────────────────────────────────────────────────────┐
│ PRODUCT SEARCH PLATFORM │
│ │
│ SCALE │
│ ├── Products: 50 million │
│ ├── Queries: 10,000/second (peak: 30K during sales) │
│ ├── Index size: ~500GB │
│ └── Freshness: Products searchable within 5 minutes of update │
│ │
│ FEATURES │
│ ├── Full-text search with relevance ranking │
│ ├── Filtered search (category, brand, price range, ratings) │
│ ├── Faceted navigation (show counts per filter) │
│ ├── Autocomplete (as-you-type suggestions) │
│ ├── Typo tolerance ("iphoen" → "iphone") │
│ ├── Synonyms ("couch" = "sofa", "tv" = "television") │
│ └── Personalized ranking (user preferences boost) │
│ │
│ REQUIREMENTS │
│ ├── Latency: p99 < 200ms │
│ ├── Availability: 99.9% │
│ ├── Consistency: Eventually consistent (5 min max) │
│ └── Accuracy: High relevance for top 10 results │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Architecture Preview
┌─────────────────────────────────────────────────────────────────────────┐
│ HIGH-LEVEL ARCHITECTURE │
│ │
│ ┌─────────────┐ │
│ │ Client │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Search API │ │
│ └──────┬──────┘ │
│ │ │
│ ┌─────────────────┼─────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ Query │ │ Search │ │ Suggest │ │
│ │ Parser │ │ Router │ │ Service │ │
│ └─────┬─────┘ └──────┬──────┘ └─────┬─────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Elasticsearch │ │
│ │ Cluster │ │
│ │ ┌───┬───┬───┐ │ │
│ │ │ S1│ S2│ S3│ │ (Shards) │
│ │ └───┴───┴───┘ │ │
│ └────────┬────────┘ │
│ │ │
│ ┌─────────────────────┼─────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Indexing │ │ Kafka │ │ PostgreSQL │ │
│ │ Pipeline │◄─────│ (CDC) │◄──────│ (Source) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Daily Breakdown
Day 1: Search Fundamentals & Architecture
Theme: "How search engines think"
TOPICS
├── Inverted indexes: The core data structure
├── Text analysis: Tokenization, stemming, normalization
├── Document modeling: Designing for search
├── Elasticsearch vs alternatives: When to use what
└── Initial architecture and component design
KEY QUESTIONS
├── Why can't we just use SQL LIKE?
├── What's an inverted index and why is it fast?
├── How do you decide what fields to index?
└── Elasticsearch vs OpenSearch vs Solr vs Algolia?
DELIVERABLES
├── Inverted index implementation (conceptual)
├── Product document schema design
├── Text analyzer configuration
└── Cluster sizing estimation
Day 2: Indexing Pipeline
Theme: "Keeping search fresh"
TOPICS
├── Change Data Capture (CDC) with Debezium
├── Streaming vs batch indexing trade-offs
├── Bulk indexing strategies
├── Schema evolution and reindexing
└── Handling deletes and updates
KEY QUESTIONS
├── How do you sync PostgreSQL → Elasticsearch?
├── What happens if indexing falls behind?
├── How do you reindex 50M documents without downtime?
└── What if a product is deleted but still in the index?
DELIVERABLES
├── CDC pipeline implementation
├── Bulk indexing service
├── Zero-downtime reindex strategy
└── Index aliasing pattern
Day 3: Query Processing & Relevance
Theme: "Finding the needle and ranking the haystack"
TOPICS
├── Query parsing and understanding
├── BM25 scoring algorithm
├── Boosting and custom scoring
├── Filters vs queries performance
└── Query optimization techniques
KEY QUESTIONS
├── How does Elasticsearch rank results?
├── When do you use filter vs query?
├── How do you boost certain products (sponsored, popular)?
└── How do you handle "no results" gracefully?
DELIVERABLES
├── Query parser implementation
├── Relevance tuning configuration
├── Function score queries
└── Query performance optimization
Day 4: Advanced Features
Theme: "From good to great search"
TOPICS
├── Autocomplete with edge n-grams
├── "Did you mean" spelling correction
├── Faceted search and aggregations
├── Synonym and multi-language handling
└── Personalized search ranking
KEY QUESTIONS
├── How does Google suggest completions so fast?
├── How do you detect and correct typos?
├── How do you show "Brand: Nike (1,234)" counts?
└── How do you handle "color" vs "colour"?
DELIVERABLES
├── Autocomplete service
├── Spelling suggestion service
├── Facet aggregation queries
└── Synonym configuration
Day 5: Operations & Scale
Theme: "Search that stays up"
TOPICS
├── Cluster sizing and shard strategy
├── Monitoring and alerting
├── Handling traffic spikes
├── Disaster recovery
└── Performance tuning
KEY QUESTIONS
├── How many shards for 500GB?
├── What metrics matter for search?
├── What happens during Black Friday (10x traffic)?
└── How do you recover from index corruption?
DELIVERABLES
├── Cluster architecture diagram
├── Monitoring dashboard
├── Capacity planning model
└── Disaster recovery runbook
Concepts from Previous Weeks Applied
This week builds heavily on previous concepts:
| Previous Week | Concept | Application in Search |
|---|---|---|
| Week 1 | Partitioning | Elasticsearch sharding strategy |
| Week 1 | Replication | Search replica configuration |
| Week 2 | Timeouts | Query timeout handling |
| Week 2 | Circuit Breakers | Search degradation during overload |
| Week 3 | Message Queues | CDC pipeline via Kafka |
| Week 3 | Backpressure | Indexing rate limiting |
| Week 4 | Caching | Query result caching |
| Week 4 | Cache Invalidation | Index freshness |
| Week 5 | Eventual Consistency | Search index lag |
| Week 6 | Async Processing | Background indexing |
Key Technologies
Primary: Elasticsearch/OpenSearch
WHY ELASTICSEARCH
Pros:
├── Purpose-built for search
├── Distributed and scalable
├── Rich query DSL
├── Built-in aggregations
├── Active community
└── Well-documented
Cons:
├── Resource intensive (memory hungry)
├── Complex cluster management
├── Eventual consistency only
├── Can be expensive at scale
└── Learning curve for tuning
Alternatives Considered
| Technology | Best For | Not For |
|---|---|---|
| Elasticsearch | General search, analytics | Strict consistency |
| OpenSearch | AWS-native, open source | Non-AWS environments |
| Solr | Enterprise, mature | Modern cloud-native |
| Algolia | SaaS simplicity | Cost-sensitive, large scale |
| Meilisearch | Simplicity, typo tolerance | Very large datasets |
| Typesense | Simple, fast | Complex queries |
| PostgreSQL FTS | Small datasets | Scale, features |
Estimation Preview
Scale Numbers
SEARCH SYSTEM SCALE
Documents:
├── Total products: 50,000,000
├── Avg document size: 10 KB
├── Total index size: 500 GB
├── With replicas (1): 1 TB
└── Fields indexed: 20 per document
Traffic:
├── Search queries: 10,000/sec average
├── Peak (Black Friday): 30,000/sec
├── Autocomplete: 50,000/sec (5 per search)
├── Indexing: 100 updates/sec average
└── Bulk imports: 1M/day (batch)
Latency targets:
├── Search p50: 50ms
├── Search p99: 200ms
├── Autocomplete p99: 50ms
└── Index freshness: 5 minutes
Infrastructure Estimate
CLUSTER SIZING (rough)
Shards:
├── 500 GB / 30 GB per shard = ~17 shards
├── Round up to 20 shards for growth
└── 1 replica = 40 total shard copies
Nodes:
├── 3 master nodes (dedicated)
├── 6 data nodes (for 40 shards)
├── 2 coordinating nodes (query routing)
└── Total: 11 nodes
Per data node:
├── RAM: 64 GB (32 GB heap + 32 GB OS cache)
├── Storage: 500 GB SSD
├── CPU: 16 cores
└── Network: 10 Gbps
Interview Angle
Search system design is common in interviews:
COMMON INTERVIEW QUESTIONS
1. "Design a search system for an e-commerce site"
→ Full system design (this week's focus)
2. "How would you implement autocomplete?"
→ Day 4: Edge n-grams, prefix queries
3. "How do you keep search in sync with the database?"
→ Day 2: CDC, indexing pipeline
4. "How would you handle a product going viral?"
→ Hot key problem, caching
5. "Design a search system for 1B documents"
→ Sharding strategy, cluster sizing
WHAT INTERVIEWERS LOOK FOR
├── Understanding of inverted indexes
├── Awareness of consistency trade-offs
├── Practical experience with relevance tuning
├── Knowledge of operational concerns
└── Ability to estimate and size systems
Week 7 Success Criteria
By Friday, you should be able to:
DESIGN
├── [ ] Architect a search system from scratch
├── [ ] Choose appropriate technologies with justification
├── [ ] Design document schemas for search
└── [ ] Estimate cluster size and capacity
BUILD
├── [ ] Implement CDC-based indexing pipeline
├── [ ] Configure text analyzers for your use case
├── [ ] Write effective search queries
└── [ ] Build autocomplete and suggestions
OPERATE
├── [ ] Monitor search cluster health
├── [ ] Handle reindexing without downtime
├── [ ] Tune for query performance
└── [ ] Plan for traffic spikes
INTERVIEW
├── [ ] Explain inverted indexes clearly
├── [ ] Discuss relevance tuning trade-offs
├── [ ] Handle "design a search system" questions
└── [ ] Know when NOT to use Elasticsearch
Connections to Future Weeks
WEEK 7 → WEEK 8 CONNECTION
This Week (Search):
- Indexing pipelines from source databases
- Query-time processing
- Real-time index updates
Next Week (Analytics):
- Similar pipeline patterns
- But optimized for aggregations, not search
- Batch-heavy instead of real-time
- Different storage (columnar vs inverted index)
Key insight: Search and Analytics are both
"derived data systems" — they take source data
and transform it for specific query patterns.
Resources
Documentation
- Elasticsearch Guide: https://www.elastic.co/guide/
- OpenSearch Docs: https://opensearch.org/docs/
- Lucene Internals: https://lucene.apache.org/
Engineering Blogs
- Airbnb Search Architecture
- Etsy Search Infrastructure
- LinkedIn Search at Scale
- Uber Search Platform
Books
- "Relevant Search" by Doug Turnbull
- "Elasticsearch: The Definitive Guide"
- "Introduction to Information Retrieval" (Stanford)
Ready to Start
Tomorrow we begin with Day 1: Search Fundamentals & Architecture.
We'll answer the fundamental question: "Why is LIKE '%term%' so slow, and what do search engines do differently?"
DAY 1 PREVIEW
You'll learn:
├── How inverted indexes work (with implementation)
├── Text analysis pipeline (tokenization → stemming → indexing)
├── Document modeling for search
├── Elasticsearch cluster architecture
└── Initial system design for product search
You'll build:
├── Conceptual inverted index
├── Product document schema
├── Text analyzer configuration
└── Cluster sizing estimate
End of Week 7 Preview
Tomorrow: Day 1 — Search Fundamentals: How search engines think