Himanshu Kukreja
0%
LearnSystem DesignWeek 7Designing Search System Preview
Week Preview

Week 7 Preview: Designing a Search System

System Design Mastery Series — Building Blocks Week


Week Overview

Theme: "Making data findable at scale"

This week we design search infrastructure from the ground up. Search is deceptively complex — it's not just "put data in Elasticsearch." Understanding search deeply means understanding inverted indexes, relevance scoring, query optimization, and the operational challenges of keeping indexes fresh and fast.

The System: A product search platform for an e-commerce site with 50M products, 10K queries per second, and sub-200ms latency requirements.


Why Search Systems Matter

Every application eventually needs search:

WITHOUT PROPER SEARCH

User: "red running shoes nike size 10"

SELECT * FROM products 
WHERE name LIKE '%red%' 
  AND name LIKE '%running%'
  AND name LIKE '%shoes%'
  AND name LIKE '%nike%'
  AND size = 10;

Problems:
├── Full table scan (50M rows)
├── No relevance ranking
├── No typo tolerance ("nikee" → 0 results)
├── No synonym handling ("sneakers" vs "shoes")
├── 30+ second query time
└── Database falls over
WITH SEARCH INFRASTRUCTURE

User: "red running shoes nike size 10"

Search engine returns in 50ms:
├── Nike Air Zoom Pegasus (red, size 10) — Score: 0.95
├── Nike Free Run (red, size 10) — Score: 0.91
├── Nike Revolution (crimson, size 10) — Score: 0.87
└── ... 234 more results

Features:
├── Instant results (inverted index)
├── Relevance ranking (BM25 + boosts)
├── Typo tolerance (fuzzy matching)
├── Synonym expansion
├── Faceted filtering
└── Autocomplete suggestions

What Makes Search Hard?

1. The Indexing Problem

WRITE PATH CHALLENGES

Source of Truth: PostgreSQL (50M products)
Search Index: Elasticsearch

Challenges:
├── How do you keep them in sync?
├── Product updated → How fast must search reflect it?
├── Bulk imports (1M new products) → Don't overwhelm the index
├── Schema changes → Reindex without downtime?
└── Deleted products → Must disappear from search immediately

2. The Relevance Problem

RELEVANCE IS SUBJECTIVE

Query: "apple"

User A (shopping for fruit):
  Wants: Fresh apples, apple juice, apple pie

User B (shopping for electronics):  
  Wants: iPhone, MacBook, AirPods

Same query, completely different intent.
How do you rank results?

3. The Scale Problem

SCALE CHALLENGES

50M documents × 20 fields = 1B field values to index
10K queries/sec × 100ms = 1M concurrent query operations
Index size: 500GB+ 
Must fit in memory for speed

Questions:
├── How many shards?
├── How many replicas?
├── How do you handle hot queries?
└── What happens during reindexing?

Week 7 Learning Objectives

By the end of this week, you will be able to:

SEARCH FUNDAMENTALS
├── Explain how inverted indexes work
├── Understand tokenization, stemming, and analysis
├── Design document schemas for search
└── Choose between Elasticsearch, OpenSearch, Solr, or custom

INDEXING PIPELINES
├── Design CDC-based indexing (Debezium pattern)
├── Implement batch vs streaming indexing trade-offs
├── Handle schema evolution without downtime
└── Build reindexing strategies

QUERY OPTIMIZATION
├── Understand BM25 and TF-IDF scoring
├── Implement query-time boosting
├── Design filter vs query clauses
├── Optimize for latency at scale

ADVANCED FEATURES
├── Build autocomplete with edge n-grams
├── Implement "did you mean" suggestions
├── Design faceted search and aggregations
├── Handle multi-language search

OPERATIONS
├── Size clusters appropriately
├── Monitor search health
├── Handle index corruption and recovery
└── Plan capacity for growth

The System We're Building

Product Search Platform

┌─────────────────────────────────────────────────────────────────────────┐
│                    PRODUCT SEARCH PLATFORM                              │
│                                                                         │
│  SCALE                                                                  │
│  ├── Products: 50 million                                               │
│  ├── Queries: 10,000/second (peak: 30K during sales)                    │
│  ├── Index size: ~500GB                                                 │
│  └── Freshness: Products searchable within 5 minutes of update          │
│                                                                         │
│  FEATURES                                                               │
│  ├── Full-text search with relevance ranking                            │
│  ├── Filtered search (category, brand, price range, ratings)            │
│  ├── Faceted navigation (show counts per filter)                        │
│  ├── Autocomplete (as-you-type suggestions)                             │
│  ├── Typo tolerance ("iphoen" → "iphone")                               │
│  ├── Synonyms ("couch" = "sofa", "tv" = "television")                   │
│  └── Personalized ranking (user preferences boost)                      │
│                                                                         │
│  REQUIREMENTS                                                           │
│  ├── Latency: p99 < 200ms                                               │
│  ├── Availability: 99.9%                                                │
│  ├── Consistency: Eventually consistent (5 min max)                     │
│  └── Accuracy: High relevance for top 10 results                        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Architecture Preview

┌─────────────────────────────────────────────────────────────────────────┐
│                    HIGH-LEVEL ARCHITECTURE                              │
│                                                                         │
│                         ┌─────────────┐                                 │
│                         │   Client    │                                 │
│                         └──────┬──────┘                                 │
│                                │                                        │
│                                ▼                                        │
│                         ┌─────────────┐                                 │
│                         │  Search API │                                 │
│                         └──────┬──────┘                                 │
│                                │                                        │
│              ┌─────────────────┼─────────────────┐                      │
│              ▼                 ▼                 ▼                      │
│       ┌───────────┐    ┌─────────────┐   ┌───────────┐                  │
│       │  Query    │    │   Search    │   │  Suggest  │                  │
│       │  Parser   │    │   Router    │   │  Service  │                  │
│       └─────┬─────┘    └──────┬──────┘   └─────┬─────┘                  │
│             │                 │                 │                       │
│             └─────────────────┼─────────────────┘                       │
│                               ▼                                         │
│                      ┌─────────────────┐                                │
│                      │  Elasticsearch  │                                │
│                      │    Cluster      │                                │
│                      │  ┌───┬───┬───┐  │                                │
│                      │  │ S1│ S2│ S3│  │  (Shards)                      │
│                      │  └───┴───┴───┘  │                                │
│                      └────────┬────────┘                                │
│                               │                                         │
│         ┌─────────────────────┼─────────────────────┐                   │
│         │                     │                     │                   │
│         ▼                     ▼                     ▼                   │
│  ┌─────────────┐      ┌─────────────┐       ┌─────────────┐             │
│  │  Indexing   │      │   Kafka     │       │ PostgreSQL  │             │
│  │  Pipeline   │◄─────│   (CDC)     │◄──────│  (Source)   │             │
│  └─────────────┘      └─────────────┘       └─────────────┘             │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Daily Breakdown

Day 1: Search Fundamentals & Architecture

Theme: "How search engines think"

TOPICS
├── Inverted indexes: The core data structure
├── Text analysis: Tokenization, stemming, normalization
├── Document modeling: Designing for search
├── Elasticsearch vs alternatives: When to use what
└── Initial architecture and component design

KEY QUESTIONS
├── Why can't we just use SQL LIKE?
├── What's an inverted index and why is it fast?
├── How do you decide what fields to index?
└── Elasticsearch vs OpenSearch vs Solr vs Algolia?

DELIVERABLES
├── Inverted index implementation (conceptual)
├── Product document schema design
├── Text analyzer configuration
└── Cluster sizing estimation

Day 2: Indexing Pipeline

Theme: "Keeping search fresh"

TOPICS
├── Change Data Capture (CDC) with Debezium
├── Streaming vs batch indexing trade-offs
├── Bulk indexing strategies
├── Schema evolution and reindexing
└── Handling deletes and updates

KEY QUESTIONS
├── How do you sync PostgreSQL → Elasticsearch?
├── What happens if indexing falls behind?
├── How do you reindex 50M documents without downtime?
└── What if a product is deleted but still in the index?

DELIVERABLES
├── CDC pipeline implementation
├── Bulk indexing service
├── Zero-downtime reindex strategy
└── Index aliasing pattern

Day 3: Query Processing & Relevance

Theme: "Finding the needle and ranking the haystack"

TOPICS
├── Query parsing and understanding
├── BM25 scoring algorithm
├── Boosting and custom scoring
├── Filters vs queries performance
└── Query optimization techniques

KEY QUESTIONS
├── How does Elasticsearch rank results?
├── When do you use filter vs query?
├── How do you boost certain products (sponsored, popular)?
└── How do you handle "no results" gracefully?

DELIVERABLES
├── Query parser implementation
├── Relevance tuning configuration
├── Function score queries
└── Query performance optimization

Day 4: Advanced Features

Theme: "From good to great search"

TOPICS
├── Autocomplete with edge n-grams
├── "Did you mean" spelling correction
├── Faceted search and aggregations
├── Synonym and multi-language handling
└── Personalized search ranking

KEY QUESTIONS
├── How does Google suggest completions so fast?
├── How do you detect and correct typos?
├── How do you show "Brand: Nike (1,234)" counts?
└── How do you handle "color" vs "colour"?

DELIVERABLES
├── Autocomplete service
├── Spelling suggestion service
├── Facet aggregation queries
└── Synonym configuration

Day 5: Operations & Scale

Theme: "Search that stays up"

TOPICS
├── Cluster sizing and shard strategy
├── Monitoring and alerting
├── Handling traffic spikes
├── Disaster recovery
└── Performance tuning

KEY QUESTIONS
├── How many shards for 500GB?
├── What metrics matter for search?
├── What happens during Black Friday (10x traffic)?
└── How do you recover from index corruption?

DELIVERABLES
├── Cluster architecture diagram
├── Monitoring dashboard
├── Capacity planning model
└── Disaster recovery runbook

Concepts from Previous Weeks Applied

This week builds heavily on previous concepts:

Previous Week Concept Application in Search
Week 1 Partitioning Elasticsearch sharding strategy
Week 1 Replication Search replica configuration
Week 2 Timeouts Query timeout handling
Week 2 Circuit Breakers Search degradation during overload
Week 3 Message Queues CDC pipeline via Kafka
Week 3 Backpressure Indexing rate limiting
Week 4 Caching Query result caching
Week 4 Cache Invalidation Index freshness
Week 5 Eventual Consistency Search index lag
Week 6 Async Processing Background indexing

Key Technologies

Primary: Elasticsearch/OpenSearch

WHY ELASTICSEARCH

Pros:
├── Purpose-built for search
├── Distributed and scalable
├── Rich query DSL
├── Built-in aggregations
├── Active community
└── Well-documented

Cons:
├── Resource intensive (memory hungry)
├── Complex cluster management
├── Eventual consistency only
├── Can be expensive at scale
└── Learning curve for tuning

Alternatives Considered

Technology Best For Not For
Elasticsearch General search, analytics Strict consistency
OpenSearch AWS-native, open source Non-AWS environments
Solr Enterprise, mature Modern cloud-native
Algolia SaaS simplicity Cost-sensitive, large scale
Meilisearch Simplicity, typo tolerance Very large datasets
Typesense Simple, fast Complex queries
PostgreSQL FTS Small datasets Scale, features

Estimation Preview

Scale Numbers

SEARCH SYSTEM SCALE

Documents:
├── Total products: 50,000,000
├── Avg document size: 10 KB
├── Total index size: 500 GB
├── With replicas (1): 1 TB
└── Fields indexed: 20 per document

Traffic:
├── Search queries: 10,000/sec average
├── Peak (Black Friday): 30,000/sec
├── Autocomplete: 50,000/sec (5 per search)
├── Indexing: 100 updates/sec average
└── Bulk imports: 1M/day (batch)

Latency targets:
├── Search p50: 50ms
├── Search p99: 200ms
├── Autocomplete p99: 50ms
└── Index freshness: 5 minutes

Infrastructure Estimate

CLUSTER SIZING (rough)

Shards:
├── 500 GB / 30 GB per shard = ~17 shards
├── Round up to 20 shards for growth
└── 1 replica = 40 total shard copies

Nodes:
├── 3 master nodes (dedicated)
├── 6 data nodes (for 40 shards)
├── 2 coordinating nodes (query routing)
└── Total: 11 nodes

Per data node:
├── RAM: 64 GB (32 GB heap + 32 GB OS cache)
├── Storage: 500 GB SSD
├── CPU: 16 cores
└── Network: 10 Gbps

Interview Angle

Search system design is common in interviews:

COMMON INTERVIEW QUESTIONS

1. "Design a search system for an e-commerce site"
   → Full system design (this week's focus)

2. "How would you implement autocomplete?"
   → Day 4: Edge n-grams, prefix queries

3. "How do you keep search in sync with the database?"
   → Day 2: CDC, indexing pipeline

4. "How would you handle a product going viral?"
   → Hot key problem, caching

5. "Design a search system for 1B documents"
   → Sharding strategy, cluster sizing

WHAT INTERVIEWERS LOOK FOR
├── Understanding of inverted indexes
├── Awareness of consistency trade-offs
├── Practical experience with relevance tuning
├── Knowledge of operational concerns
└── Ability to estimate and size systems

Week 7 Success Criteria

By Friday, you should be able to:

DESIGN
├── [ ] Architect a search system from scratch
├── [ ] Choose appropriate technologies with justification
├── [ ] Design document schemas for search
└── [ ] Estimate cluster size and capacity

BUILD
├── [ ] Implement CDC-based indexing pipeline
├── [ ] Configure text analyzers for your use case
├── [ ] Write effective search queries
└── [ ] Build autocomplete and suggestions

OPERATE
├── [ ] Monitor search cluster health
├── [ ] Handle reindexing without downtime
├── [ ] Tune for query performance
└── [ ] Plan for traffic spikes

INTERVIEW
├── [ ] Explain inverted indexes clearly
├── [ ] Discuss relevance tuning trade-offs
├── [ ] Handle "design a search system" questions
└── [ ] Know when NOT to use Elasticsearch

Connections to Future Weeks

WEEK 7 → WEEK 8 CONNECTION

This Week (Search):
- Indexing pipelines from source databases
- Query-time processing
- Real-time index updates

Next Week (Analytics):
- Similar pipeline patterns
- But optimized for aggregations, not search
- Batch-heavy instead of real-time
- Different storage (columnar vs inverted index)

Key insight: Search and Analytics are both
"derived data systems" — they take source data
and transform it for specific query patterns.

Resources

Documentation

Engineering Blogs

  • Airbnb Search Architecture
  • Etsy Search Infrastructure
  • LinkedIn Search at Scale
  • Uber Search Platform

Books

  • "Relevant Search" by Doug Turnbull
  • "Elasticsearch: The Definitive Guide"
  • "Introduction to Information Retrieval" (Stanford)

Ready to Start

Tomorrow we begin with Day 1: Search Fundamentals & Architecture.

We'll answer the fundamental question: "Why is LIKE '%term%' so slow, and what do search engines do differently?"

DAY 1 PREVIEW

You'll learn:
├── How inverted indexes work (with implementation)
├── Text analysis pipeline (tokenization → stemming → indexing)
├── Document modeling for search
├── Elasticsearch cluster architecture
└── Initial system design for product search

You'll build:
├── Conceptual inverted index
├── Product document schema
├── Text analyzer configuration
└── Cluster sizing estimate

End of Week 7 Preview

Tomorrow: Day 1 — Search Fundamentals: How search engines think