Himanshu Kukreja
0%

Bonus Problem 5: Google Search

The World's Most Complex Information Retrieval System


πŸ” Answering 8.5 Billion Questions a Day β€” In Milliseconds

Imagine this challenge: Someone types three words into a search box.

Within 200 milliseconds, you must search through 400 billion documents β€” the equivalent of every book ever written, times 100,000 β€” find the most relevant results, rank them by quality and relevance, check for spam, personalize the results, and display them in a beautiful interface.

Now do that 100,000 times per second, every second, 24/7.

This is Google Search β€” the most sophisticated information retrieval system ever built.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                              β”‚
β”‚                       THE GOOGLE SEARCH SCALE (2025)                         β”‚
β”‚                                                                              β”‚
β”‚   QUERY VOLUME                                                               β”‚
β”‚   ────────────                                                               β”‚
β”‚   Daily searches:              8.5 - 9.1 Billion                             β”‚
β”‚   Searches per second:         ~100,000 peak                                 β”‚
β”‚   Annual searches:             5+ Trillion                                   β”‚
β”‚   Market share:                91.5% globally                                β”‚
β”‚                                                                              β”‚
β”‚   INDEX SIZE                                                                 β”‚
β”‚   ──────────                                                                 β”‚
β”‚   Indexed documents:           ~400 Billion                                  β”‚
β”‚   Known URLs:                  130+ Trillion                                 β”‚
β”‚   Index storage:               100+ Petabytes                                β”‚
β”‚   New URLs discovered:         Billions per day                              β”‚
β”‚                                                                              β”‚
β”‚   INFRASTRUCTURE                                                             β”‚
β”‚   ──────────────                                                             β”‚
β”‚   Data centers:                30+ worldwide                                 β”‚
β”‚   Servers (estimated):         2.5+ Million                                  β”‚
β”‚   Power consumption:           ~18.3 TWh/year                                β”‚
β”‚   Response time target:        <200ms (p99)                                  β”‚
β”‚                                                                              β”‚
β”‚   BUSINESS IMPACT                                                            β”‚
β”‚   ───────────────                                                            β”‚
β”‚   Search ad revenue (2024):    \$175+ Billion                                β”‚
β”‚   Revenue per search:          ~\$0.03                                       β”‚
β”‚   Users worldwide:             5+ Billion                                    β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This is the system we'll design today β€” and understand how search engines achieve the impossible: sub-second retrieval from hundreds of billions of documents.


The Interview Begins

You walk into the interview room. The interviewer smiles and gestures to the whiteboard.

Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I want to see how you think about large-scale distributed systems. Feel free to ask questions β€” this should be collaborative."

They write on the whiteboard:

╔══════════════════════════════════════════════════════════════════════════════╗
β•‘                                                                              β•‘
β•‘                        Design a Web Search Engine                            β•‘
β•‘                                                                              β•‘
β•‘   Build a search engine that can:                                            β•‘
β•‘   - Index billions of web pages                                              β•‘
β•‘   - Return relevant results in <500ms                                        β•‘
β•‘   - Handle 10,000+ queries per second                                        β•‘
β•‘   - Keep the index fresh (new content discoverable within hours)             β•‘
β•‘   - Rank results by relevance and quality                                    β•‘
β•‘                                                                              β•‘
β•‘   Consider:                                                                  β•‘
β•‘   - How do you discover new pages?                                           β•‘
β•‘   - How do you store and query the index efficiently?                        β•‘
β•‘   - How do you rank billions of documents?                                   β•‘
β•‘   - How do you handle 100K QPS with sub-second latency?                      β•‘
β•‘                                                                              β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."


Phase 1: Requirements Clarification (5 minutes)

Before diving in, you take a breath and start asking questions.

Your Questions

You: "Before I start designing, I'd like to clarify a few requirements. First, what's the scale of the web we need to index? Are we talking millions or billions of pages?"

Interviewer: "Let's say we need to index around 50 billion pages initially, with the capability to scale to hundreds of billions."

You: "And what's our query volume? Peak QPS?"

Interviewer: "Target 10,000 QPS initially, scaling to 100,000 QPS."

You: "What about freshness requirements? How quickly should new content appear in search results?"

Interviewer: "Critical news should appear within minutes. Regular content within hours. Most content within a few days is acceptable."

You: "For ranking, should we implement something like PageRank, or simpler relevance scoring?"

Interviewer: "Start with basic relevance like TF-IDF, then layer in link-based authority scoring like PageRank."

You: "Perfect. Let me summarize the requirements."

Functional Requirements

1. WEB CRAWLING
   - Discover and download web pages continuously
   - Respect robots.txt and crawl rate limits
   - Handle various content types (HTML, PDF, etc.)
   - Track page freshness and update frequency

2. INDEXING
   - Parse and extract text content from pages
   - Build searchable index from page content
   - Store page metadata (title, URL, timestamp)
   - Support incremental index updates

3. QUERY PROCESSING
   - Parse and understand user queries
   - Retrieve matching documents efficiently
   - Rank results by relevance and quality
   - Support features: autocomplete, spell check, filters

4. RANKING
   - Text relevance scoring (TF-IDF, BM25)
   - Link-based authority (PageRank)
   - Freshness signals
   - Spam detection and filtering

Non-Functional Requirements

1. SCALE
   - 50+ billion indexed pages
   - 100+ petabytes of index data
   - 100,000 QPS at peak

2. LATENCY
   - Query response: <500ms p99
   - Index freshness: <24 hours for most content
   - Breaking news: <15 minutes

3. AVAILABILITY
   - 99.99% uptime
   - Graceful degradation under load
   - Multi-region deployment

4. CONSISTENCY
   - Eventually consistent index
   - Stale reads acceptable (within hours)
   - Strong consistency for user data only

Phase 2: Back of the Envelope Estimation (5 minutes)

You: "Let me work through the numbers to understand the scale we're dealing with."

Index Size Estimation

DOCUMENT STORAGE

Pages to index:                 50 billion
Average page size (compressed): 50 KB
Raw page storage:               50B Γ— 50KB = 2.5 PB

INVERTED INDEX

Unique terms:                   ~10 billion
Average postings per term:      ~1 million
Posting entry size:             12 bytes (docID + position + metadata)
Inverted index size:            10B Γ— 1M Γ— 12B = 120 PB

With compression (5x):          ~25 PB
With replication (3x):          ~75 PB total

FORWARD INDEX (document β†’ terms)

Per-document metadata:          1 KB
Forward index:                  50B Γ— 1KB = 50 TB
With replication:               150 TB

Query Traffic Estimation

QUERY VOLUME

Peak QPS:                       100,000
Queries per day:                ~5 billion
Average query length:           3.4 words

QUERY PROCESSING

Documents to score per query:   ~10,000 candidates
Scoring time per doc:           ~10 microseconds
Total scoring time:             ~100ms
Results to return:              10-20 per page

BANDWIDTH

Query request size:             ~500 bytes
Response size:                  ~50 KB
Peak bandwidth:                 100K Γ— 50KB = 5 GB/s

Infrastructure Estimation

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       INFRASTRUCTURE SUMMARY                                 β”‚
β”‚                                                                              β”‚
β”‚   STORAGE                                                                    β”‚
β”‚   β”œβ”€β”€ Index storage:          ~100 PB                                        β”‚
β”‚   β”œβ”€β”€ Document storage:       ~10 PB (compressed)                            β”‚
β”‚   └── Metadata/logs:          ~5 PB                                          β”‚
β”‚                                                                              β”‚
β”‚   COMPUTE                                                                    β”‚
β”‚   β”œβ”€β”€ Index servers:          ~10,000 machines                               β”‚
β”‚   β”œβ”€β”€ Crawlers:               ~1,000 machines                                β”‚
β”‚   β”œβ”€β”€ Query servers:          ~5,000 machines                                β”‚
β”‚   └── Ranking/ML:             ~2,000 machines                                β”‚
β”‚                                                                              β”‚
β”‚   NETWORK                                                                    β”‚
β”‚   β”œβ”€β”€ Internal bandwidth:     100+ Tbps                                      β”‚
β”‚   └── External bandwidth:     10+ Tbps                                       β”‚
β”‚                                                                              β”‚
β”‚   DATA CENTERS                                                               β”‚
β”‚   └── Minimum:                5-7 regions globally                           β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 3: High-Level Design (10 minutes)

You: "Now let me sketch out the high-level architecture. A search engine has three major components: crawling, indexing, and serving."

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       SEARCH ENGINE ARCHITECTURE                            β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚                         CRAWLING PIPELINE                              β”‚ β”‚
β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚ β”‚
β”‚  β”‚   β”‚  URL    │────▢│ Crawler │────▢│ Content │────▢│  Page   β”‚          β”‚ β”‚
β”‚  β”‚   β”‚ Frontierβ”‚     β”‚ Workers β”‚     β”‚ Parser  β”‚     β”‚  Store  β”‚          β”‚ β”‚
β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                      β”‚                                      β”‚
β”‚                                      β–Ό                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚                         INDEXING PIPELINE                              β”‚ β”‚
β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚ β”‚
β”‚  β”‚   β”‚Document │────▢│Tokenizer│────▢│  Index  │────▢│Inverted β”‚          β”‚ β”‚
β”‚  β”‚   β”‚ Parser  β”‚     β”‚/Analyzerβ”‚     β”‚ Builder β”‚     β”‚  Index  β”‚          β”‚ β”‚
β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚ β”‚
β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                          β”‚ β”‚
β”‚  β”‚   β”‚PageRank │◀─────────────────────────────────────────────            β”‚ β”‚
β”‚  β”‚   β”‚ Compute β”‚                                                          β”‚ β”‚
β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                      β”‚                                      β”‚
β”‚                                      β–Ό                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚                         SERVING LAYER                                  β”‚ β”‚
β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚ β”‚
β”‚  β”‚   β”‚  Query  │────▢│Retrieval│────▢│ Ranking │────▢│ Results β”‚          β”‚ β”‚
β”‚  β”‚   β”‚ Parser  β”‚     β”‚ Engine  β”‚     β”‚ Engine  β”‚     β”‚ Server  β”‚          β”‚ β”‚
β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Breakdown

1. Crawling Pipeline

Purpose: Continuously discover and download web pages from the internet.

Key components:

  • URL Frontier: Priority queue of URLs to crawl, ordered by importance and freshness
  • Crawler Workers: Distributed fetchers that download pages, respecting rate limits
  • Content Parser: Extract text, links, and metadata from HTML
  • Link Extractor: Discover new URLs to feed back to the frontier

2. Indexing Pipeline

Purpose: Process crawled pages and build searchable data structures.

Key components:

  • Document Parser: Clean HTML, extract text, handle encoding
  • Tokenizer/Analyzer: Split text into tokens, apply stemming, remove stopwords
  • Index Builder: Construct inverted index data structures
  • PageRank Compute: Calculate link-based authority scores

3. Serving Layer

Purpose: Handle user queries and return ranked results.

Key components:

  • Query Parser: Parse query, expand synonyms, handle operators
  • Retrieval Engine: Find candidate documents from inverted index
  • Ranking Engine: Score and sort candidates by relevance
  • Results Server: Format and return final results

Phase 4: Deep Dives (20 minutes)

Interviewer: "Great overview. Let's dive deeper into some challenging aspects. How does the inverted index actually work?"


Deep Dive 1: The Inverted Index (Week 1 - Partitioning & Sharding)

You: "The inverted index is the heart of any search engine. Let me explain how it works and how we scale it."

What is an Inverted Index?

INVERTED INDEX STRUCTURE

Traditional index (forward):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Document   β”‚ Words                                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Doc 1      β”‚ "the quick brown fox jumps"             β”‚
β”‚ Doc 2      β”‚ "the lazy brown dog sleeps"             β”‚
β”‚ Doc 3      β”‚ "quick fox runs fast"                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Inverted index (reverse):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Term       β”‚ Postings List (DocID:Positions)         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ "brown"    β”‚ [(Doc1, [3]), (Doc2, [3])]              β”‚
β”‚ "fox"      β”‚ [(Doc1, [4]), (Doc3, [2])]              β”‚
β”‚ "quick"    β”‚ [(Doc1, [2]), (Doc3, [1])]              β”‚
β”‚ "lazy"     β”‚ [(Doc2, [2])]                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

To find "quick fox":
  1. Look up "quick" β†’ {Doc1, Doc3}
  2. Look up "fox" β†’ {Doc1, Doc3}
  3. Intersect β†’ {Doc1, Doc3}
  4. Score and rank

Sharding the Index

You: "With 400 billion documents and 10 billion unique terms, we need to shard the index across thousands of machines."

INDEX SHARDING STRATEGIES

Option 1: Document-based sharding
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Each shard contains ALL terms for a SUBSET of documents                   β”‚
β”‚                                                                             β”‚
β”‚   Shard 1: Docs 0-1M         Shard 2: Docs 1M-2M       Shard N: ...         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚   β”‚ "apple" β†’ [...]β”‚         β”‚ "apple" β†’ [...]β”‚                             β”‚
β”‚   β”‚ "banana"β†’ [...]β”‚         β”‚ "banana"β†’ [...]β”‚                             β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                                                                             β”‚
β”‚   Query: Search ALL shards in parallel, merge results                       β”‚
β”‚   Pro: Each shard is self-contained                                         β”‚
β”‚   Con: Every query hits every shard                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Option 2: Term-based sharding
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Each shard contains ALL documents for a SUBSET of terms                   β”‚
β”‚                                                                             β”‚
β”‚   Shard A-F: Terms A-F       Shard G-L: Terms G-L                           β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚   β”‚ "apple" β†’ all  β”‚         β”‚ "guitar"β†’ all  β”‚                             β”‚
β”‚   β”‚ "banana"β†’ all  β”‚         β”‚ "house" β†’ all  β”‚                             β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                                                                             β”‚
β”‚   Query: Only hit shards containing query terms                             β”‚
β”‚   Pro: Fewer shards per query                                               β”‚
β”‚   Con: Cross-shard coordination for scoring                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Google uses: Document-based sharding (with tiered index)

Implementation

# inverted_index/index_shard.py

"""
Inverted Index Shard Implementation
Demonstrates the core data structures used in search engines.
"""

from dataclasses import dataclass, field
from typing import List, Dict, Optional
from collections import defaultdict
import bisect
import math


@dataclass
class Posting:
    """A single posting entry in the inverted index."""
    doc_id: int
    term_frequency: int
    positions: List[int] = field(default_factory=list)


@dataclass
class PostingsList:
    """List of all postings for a single term."""
    term: str
    doc_frequency: int
    postings: List[Posting] = field(default_factory=list)
    
    def intersect(self, other: 'PostingsList') -> 'PostingsList':
        """
        Intersect two posting lists (for AND queries).
        Uses merge algorithm - O(n + m).
        """
        result = PostingsList(
            term=f"{self.term} AND {other.term}",
            doc_frequency=0,
            postings=[]
        )
        
        i, j = 0, 0
        while i < len(self.postings) and j < len(other.postings):
            if self.postings[i].doc_id == other.postings[j].doc_id:
                combined = Posting(
                    doc_id=self.postings[i].doc_id,
                    term_frequency=self.postings[i].term_frequency + 
                                   other.postings[j].term_frequency,
                    positions=sorted(self.postings[i].positions + 
                                    other.postings[j].positions)
                )
                result.postings.append(combined)
                i += 1
                j += 1
            elif self.postings[i].doc_id < other.postings[j].doc_id:
                i += 1
            else:
                j += 1
        
        result.doc_frequency = len(result.postings)
        return result


class InvertedIndexShard:
    """
    A single shard of the inverted index.
    
    In production:
    - Each shard handles ~1-10M documents
    - Index is memory-mapped for fast access
    - Postings are sorted by doc_id for efficient intersection
    """
    
    def __init__(self, shard_id: int, doc_range: tuple):
        self.shard_id = shard_id
        self.doc_range = doc_range
        self.index: Dict[str, PostingsList] = {}
        self.doc_count = 0
        
    def add_document(self, doc_id: int, tokens: List[str]) -> None:
        """Index a single document."""
        term_positions: Dict[str, List[int]] = defaultdict(list)
        for position, token in enumerate(tokens):
            term_positions[token].append(position)
        
        for term, positions in term_positions.items():
            if term not in self.index:
                self.index[term] = PostingsList(term=term, doc_frequency=0)
            
            posting = Posting(
                doc_id=doc_id,
                term_frequency=len(positions),
                positions=positions
            )
            
            bisect.insort(
                self.index[term].postings, 
                posting,
                key=lambda p: p.doc_id
            )
            self.index[term].doc_frequency += 1
        
        self.doc_count += 1
    
    def search(self, query_terms: List[str]) -> PostingsList:
        """Search for documents containing all query terms."""
        if not query_terms:
            return PostingsList(term="", doc_frequency=0)
        
        postings_lists = []
        for term in query_terms:
            pl = self.index.get(term)
            if pl is None:
                return PostingsList(term=term, doc_frequency=0)
            postings_lists.append(pl)
        
        # Sort by document frequency (shortest first)
        postings_lists.sort(key=lambda pl: pl.doc_frequency)
        
        # Intersect all posting lists
        result = postings_lists[0]
        for pl in postings_lists[1:]:
            result = result.intersect(pl)
            if result.doc_frequency == 0:
                break
        
        return result
    
    def compute_idf(self, doc_frequency: int) -> float:
        """Compute Inverse Document Frequency."""
        if doc_frequency == 0:
            return 0.0
        return math.log(self.doc_count / doc_frequency)

Deep Dive 2: PageRank Algorithm (Week 5 - Consistency & Coordination)

Interviewer: "Tell me about PageRank. How does Google determine page importance?"

You: "PageRank is the algorithm that made Google famous. It's essentially a voting system where links are votes."

The PageRank Concept

PAGERANK: THE RANDOM SURFER MODEL

Imagine a random web surfer:
- Start on a random page
- Follow a random link on that page
- Repeat forever

PageRank = probability that the random surfer lands on a page

Pages with more incoming links β†’ higher probability β†’ higher PageRank

The Mathematics

PAGERANK FORMULA

PR(A) = (1-d)/N + d Γ— Ξ£(PR(i)/L(i))

Where:
  PR(A) = PageRank of page A
  d     = Damping factor (typically 0.85)
  N     = Total number of pages
  PR(i) = PageRank of pages linking to A
  L(i)  = Number of outgoing links from page i

The damping factor represents the probability that the random
surfer gets bored and jumps to a random page instead of following links.

Implementation

# ranking/pagerank.py

"""
PageRank Computation at Scale
Demonstrates how to compute PageRank for billions of pages.
"""

from dataclasses import dataclass
from typing import Dict, List


@dataclass
class WebPage:
    """Represents a web page in the link graph."""
    url: str
    outgoing_links: List[str]
    incoming_links: List[str] = None
    pagerank: float = 1.0


class PageRankComputer:
    """
    Computes PageRank scores for a web graph.
    
    At Google scale:
    - Graph has 100+ billion nodes
    - Computation takes days on thousands of machines
    - Uses MapReduce for distributed iteration
    """
    
    def __init__(self, damping_factor: float = 0.85, max_iterations: int = 100,
                 convergence_threshold: float = 1e-6):
        self.damping_factor = damping_factor
        self.max_iterations = max_iterations
        self.convergence_threshold = convergence_threshold
        self.pages: Dict[str, WebPage] = {}
        self.num_pages = 0
        
    def add_page(self, url: str, outgoing_links: List[str]) -> None:
        """Add a page to the graph."""
        self.pages[url] = WebPage(url=url, outgoing_links=outgoing_links)
        
        for link in outgoing_links:
            if link not in self.pages:
                self.pages[link] = WebPage(url=link, outgoing_links=[])
        
        self.num_pages = len(self.pages)
    
    def build_reverse_links(self) -> None:
        """Build incoming link lists for each page."""
        for page in self.pages.values():
            page.incoming_links = []
        
        for url, page in self.pages.items():
            for link in page.outgoing_links:
                if link in self.pages:
                    self.pages[link].incoming_links.append(url)
    
    def compute(self) -> Dict[str, float]:
        """
        Compute PageRank using power iteration.
        
        Algorithm:
        1. Initialize all pages with equal rank (1/N)
        2. For each iteration:
           - Sum contributions from incoming links
           - Apply damping factor
        3. Repeat until convergence
        """
        if not self.pages:
            return {}
        
        self.build_reverse_links()
        
        # Initialize PageRank values
        initial_rank = 1.0 / self.num_pages
        for page in self.pages.values():
            page.pagerank = initial_rank
        
        # Iterative computation
        for iteration in range(self.max_iterations):
            new_ranks: Dict[str, float] = {}
            
            # Handle dangling pages
            dangling_sum = sum(
                page.pagerank for page in self.pages.values()
                if len(page.outgoing_links) == 0
            )
            dangling_contribution = dangling_sum / self.num_pages
            
            # Compute new rank for each page
            for url, page in self.pages.items():
                new_rank = (1 - self.damping_factor) / self.num_pages
                new_rank += self.damping_factor * dangling_contribution
                
                for source_url in page.incoming_links:
                    source_page = self.pages[source_url]
                    num_outlinks = len(source_page.outgoing_links)
                    if num_outlinks > 0:
                        contribution = source_page.pagerank / num_outlinks
                        new_rank += self.damping_factor * contribution
                
                new_ranks[url] = new_rank
            
            # Check for convergence
            max_diff = max(
                abs(new_ranks[url] - self.pages[url].pagerank)
                for url in self.pages
            )
            
            for url, rank in new_ranks.items():
                self.pages[url].pagerank = rank
            
            if max_diff < self.convergence_threshold:
                print(f"Converged after {iteration + 1} iterations")
                break
        
        return {url: page.pagerank for url, page in self.pages.items()}

Deep Dive 3: Query Processing Pipeline (Week 2 - Timeouts & Latency)

Interviewer: "Walk me through what happens when a user types a query."

You: "Query processing is where all the latency pressure is. We have 500ms to do everything."

Query Processing Flow

QUERY PROCESSING TIMELINE (Target: <500ms)

User types: "best pizza near me"
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 1: QUERY UNDERSTANDING (20ms budget)                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚Spell Check   β”‚  β”‚ Tokenize     β”‚  β”‚ Synonyms     β”‚  β”‚ Intent       β”‚   β”‚
│  │"piza"→"pizza"│  │ Split words  │  │ best→top     │  │ Local search │   │
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 2: RETRIEVAL (100ms budget)                                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Query sent to all index shards in parallel                               β”‚
β”‚  Intersect posting lists, return top 1000 candidates per shard            β”‚
β”‚  Total candidates: ~10,000 documents                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 3: SCORING (200ms budget)                                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  For each candidate, compute scores:                                      β”‚
β”‚  - Text Relevance (BM25): Weight 0.3                                      β”‚
β”‚  - PageRank: Weight 0.2                                                   β”‚
β”‚  - Freshness: Weight 0.1                                                  β”‚
β”‚  - Location Match: Weight 0.2                                             β”‚
β”‚  - Click History: Weight 0.2                                              β”‚
β”‚  Sort by final score, take top 100                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 4: SNIPPET GENERATION (50ms budget)                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  For top 10 results:                                                      β”‚
β”‚  - Fetch document content (or cached snippet)                             β”‚
β”‚  - Find best passage containing query terms                               β”‚
β”‚  - Highlight matching terms                                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 5: RESPONSE (30ms budget)                                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Total time: ~400ms (target <500ms)                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

BM25 Scoring Implementation

# query/scorer.py

"""
BM25 Ranking Algorithm
An improvement over TF-IDF that handles term frequency saturation
and document length normalization.
"""

import math
from typing import Dict, List, Optional


class BM25Scorer:
    """
    BM25 scoring for search relevance.
    
    BM25 is the industry standard for text relevance scoring.
    It improves on TF-IDF by:
    - Saturating term frequency (diminishing returns)
    - Normalizing for document length
    """
    
    K1 = 1.2  # Term frequency saturation parameter
    B = 0.75  # Length normalization parameter
    
    def __init__(self, avg_doc_length: float, total_docs: int):
        self.avg_doc_length = avg_doc_length
        self.total_docs = total_docs
    
    def score(self, doc: Dict, query_terms: List[str]) -> float:
        """
        Compute BM25 score for a document.
        
        Args:
            doc: Document with term_freqs, doc_length, doc_freqs
            query_terms: List of query terms
            
        Returns:
            BM25 score (higher = more relevant)
        """
        score = 0.0
        doc_length = doc.get("doc_length", self.avg_doc_length)
        term_freqs = doc.get("term_freqs", {})
        
        for term in query_terms:
            tf = term_freqs.get(term, 0)
            if tf == 0:
                continue
            
            # IDF component
            df = doc.get("doc_freqs", {}).get(term, 1)
            idf = math.log((self.total_docs - df + 0.5) / (df + 0.5) + 1)
            
            # TF component with saturation
            numerator = tf * (self.K1 + 1)
            denominator = tf + self.K1 * (
                1 - self.B + self.B * (doc_length / self.avg_doc_length)
            )
            
            score += idf * (numerator / denominator)
        
        return score
    
    def compute_freshness(self, last_updated) -> float:
        """Compute freshness score with exponential decay."""
        from datetime import datetime
        
        if last_updated is None:
            return 0.5
        
        age_days = (datetime.utcnow() - last_updated).days
        half_life = 30  # 30-day half-life
        return math.exp(-0.693 * age_days / half_life)

Deep Dive 4: Web Crawling at Scale (Week 3 - Async Processing)

Interviewer: "How do you crawl billions of pages efficiently?"

You: "Crawling is fundamentally an async, distributed problem."

Crawling Architecture

WEB CRAWLING PIPELINE

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          URL FRONTIER                                       β”‚
β”‚   Priority Queue of URLs to crawl, ordered by:                              β”‚
β”‚   - Importance (PageRank of linking pages)                                  β”‚
β”‚   - Freshness (time since last crawl)                                       β”‚
β”‚   - Domain priority (news sites crawled more often)                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         CRAWLER WORKERS                                     β”‚
β”‚   1000s of distributed workers fetching pages in parallel                   β”‚
β”‚                                                                             β”‚
β”‚   POLITENESS CONSTRAINTS:                                                   β”‚
β”‚   - Max 1 request per second per domain                                     β”‚
β”‚   - Respect robots.txt                                                      β”‚
β”‚   - Respect Crawl-Delay directive                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         CONTENT PROCESSOR                                   β”‚
β”‚   Parse HTML β†’ Extract Links & Text β†’ Detect Language β†’ Store Content       β”‚
β”‚   New URLs fed back to URL Frontier                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Crawler Implementation

# crawler/web_crawler.py

"""
Distributed Web Crawler
Demonstrates key crawling concepts:
- Politeness (rate limiting per domain)
- Priority-based URL frontier
- Parallel fetching with async IO
"""

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Set
from datetime import datetime
from urllib.parse import urlparse, urljoin
import asyncio
import heapq


@dataclass
class CrawlURL:
    """A URL in the crawl frontier."""
    url: str
    priority: float
    last_crawl: Optional[datetime] = None
    depth: int = 0
    
    def __lt__(self, other):
        return self.priority > other.priority


class URLFrontier:
    """
    Priority queue of URLs to crawl.
    
    Key features:
    - Priority-based ordering
    - Deduplication
    - Per-domain queuing for politeness
    """
    
    def __init__(self, max_size: int = 10_000_000):
        self.max_size = max_size
        self.queue: List[CrawlURL] = []
        self.seen_urls: Set[str] = set()
        self.domain_last_access: Dict[str, datetime] = {}
        
    def add(self, url: str, priority: float = 0.5, depth: int = 0) -> bool:
        """Add URL to frontier if not seen."""
        url = url.lower().rstrip("/").split("#")[0]
        
        if url in self.seen_urls:
            return False
        
        if len(self.seen_urls) >= self.max_size:
            return False
        
        crawl_url = CrawlURL(url=url, priority=priority, depth=depth)
        heapq.heappush(self.queue, crawl_url)
        self.seen_urls.add(url)
        return True
    
    def get_next(self) -> Optional[CrawlURL]:
        """Get next URL respecting politeness constraints."""
        if not self.queue:
            return None
        
        now = datetime.utcnow()
        
        while self.queue:
            crawl_url = heapq.heappop(self.queue)
            domain = urlparse(crawl_url.url).netloc
            
            last_access = self.domain_last_access.get(domain)
            if last_access and (now - last_access).total_seconds() < 1.0:
                heapq.heappush(self.queue, crawl_url)
                continue
            
            self.domain_last_access[domain] = now
            return crawl_url
        
        return None


class WebCrawler:
    """
    Distributed web crawler with async IO.
    """
    
    def __init__(self, num_workers: int = 100, max_pages: int = 1_000_000):
        self.num_workers = num_workers
        self.max_pages = max_pages
        self.frontier = URLFrontier()
        self.pages_crawled = 0
    
    def add_seeds(self, seed_urls: List[str]) -> None:
        """Add seed URLs to start crawling."""
        for url in seed_urls:
            self.frontier.add(url, priority=1.0, depth=0)
    
    async def crawl(self) -> None:
        """Start crawling with multiple workers."""
        workers = [
            asyncio.create_task(self._worker(i))
            for i in range(self.num_workers)
        ]
        await asyncio.gather(*workers)
    
    async def _worker(self, worker_id: int) -> None:
        """Single crawler worker."""
        import aiohttp
        
        async with aiohttp.ClientSession(
            headers={"User-Agent": "GoogleBot/2.1"}
        ) as session:
            while self.pages_crawled < self.max_pages:
                crawl_url = self.frontier.get_next()
                if crawl_url is None:
                    await asyncio.sleep(0.1)
                    continue
                
                try:
                    async with session.get(
                        crawl_url.url,
                        timeout=aiohttp.ClientTimeout(total=30)
                    ) as response:
                        if response.status == 200:
                            content = await response.text()
                            links = self._extract_links(crawl_url.url, content)
                            
                            for link in links:
                                self.frontier.add(
                                    link, priority=0.5,
                                    depth=crawl_url.depth + 1
                                )
                            
                            self.pages_crawled += 1
                except Exception:
                    pass
    
    def _extract_links(self, base_url: str, html: str) -> List[str]:
        """Extract links from HTML content."""
        import re
        links = []
        href_pattern = re.compile(r'href=["\']([^"\']+)["\']')
        
        for match in href_pattern.finditer(html):
            href = match.group(1)
            if href.startswith(("javascript:", "mailto:", "#")):
                continue
            
            absolute_url = urljoin(base_url, href)
            if absolute_url.startswith(("http://", "https://")):
                links.append(absolute_url)
        
        return links

Phase 5: Scaling and Edge Cases (5 minutes)

Interviewer: "How would you scale this to handle 10x traffic?"

Scaling Strategy

SCALING FROM 10K TO 100K QPS

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       SCALING STRATEGY                                      β”‚
β”‚                                                                             β”‚
β”‚   QUERY PATH                                                                β”‚
β”‚   ──────────                                                                β”‚
β”‚   Current: 10K QPS  β†’  Target: 100K QPS                                     β”‚
β”‚                                                                             β”‚
β”‚   Strategy:                                                                 β”‚
β”‚   1. Add more query servers (10x)                                           β”‚
β”‚   2. Increase index shard replicas (3x β†’ 10x)                               β”‚
β”‚   3. Add query result cache (hit rate ~30%)                                 β”‚
β”‚   4. Deploy in more regions (latency optimization)                          β”‚
β”‚                                                                             β”‚
β”‚   INDEX PATH                                                                β”‚
β”‚   ──────────                                                                β”‚
β”‚   Current: 50B documents  β†’  Target: 500B documents                         β”‚
β”‚                                                                             β”‚
β”‚   Strategy:                                                                 β”‚
β”‚   1. Add more index shards (10x)                                            β”‚
β”‚   2. Tiered storage (hot/warm/cold)                                         β”‚
β”‚   3. Better compression (reduce storage cost)                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Edge Cases

Edge Case 1: Query Spikes

Scenario: Breaking news causes 10x traffic spike
Problem: Index shards for hot terms get overwhelmed
Solution:
  - Query result caching with short TTL
  - Replicate hot index shards dynamically
  - Graceful degradation (return slightly stale results)

Edge Case 2: Index Freshness

Scenario: Celebrity dies, but search shows them as "alive"
Problem: Cached PageRank and index data is stale
Solution:
  - Breaking news detection pipeline
  - Fast-path indexing for high-priority content
  - Real-time signals from trusted sources

Edge Case 3: Spam Attacks

Scenario: SEO farms create millions of spam pages
Problem: Index quality degrades
Solution:
  - Link graph analysis (detect link farms)
  - Content fingerprinting (detect duplicates)
  - ML spam classifiers

Edge Case 4: Crawl Traps

Scenario: Website generates infinite URLs
Problem: Crawler gets stuck, wastes resources
Solution:
  - Detect URL patterns generating infinite content
  - Limit crawl depth per domain
  - Track unique content ratio

Phase 6: Monitoring and Operations

Monitoring Dashboard

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       SEARCH ENGINE MONITORING                               β”‚
β”‚                                                                              β”‚
β”‚   QUERY PERFORMANCE                                                          β”‚
β”‚   ─────────────────                                                          β”‚
β”‚   QPS:                      87,432 /sec          [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 87%            β”‚
β”‚   Latency p50:              145 ms               [β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘] 29%            β”‚
β”‚   Latency p99:              412 ms               [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘] 82%            β”‚
β”‚   Error rate:               0.02%                [β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] OK             β”‚
β”‚                                                                              β”‚
β”‚   INDEX HEALTH                                                               β”‚
β”‚   ────────────                                                               β”‚
β”‚   Indexed docs:             48.2B                [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 96%            β”‚
β”‚   Index freshness:          2.3 hours avg        [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘] OK             β”‚
β”‚   Shard replicas healthy:   99.97%               [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] OK             β”‚
β”‚                                                                              β”‚
β”‚   QUALITY SIGNALS                                                            β”‚
β”‚   ───────────────                                                            β”‚
β”‚   Click-through rate:       34.2%                [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] ↑2%            β”‚
β”‚   Spam detection rate:      99.3%                [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] OK             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Alerting Strategy

CRITICAL (Page immediately):
  - Query latency p99 > 1s for > 2 min
  - Error rate > 1%
  - Index shard unavailable

WARNING (Slack notification):
  - Query latency p99 > 500ms for > 5 min
  - Index freshness > 12 hours
  - Click-through rate drops > 10%

Interview Conclusion

Interviewer: "Excellent work. You've demonstrated strong understanding of distributed systems, information retrieval, and the specific challenges of building a search engine. Any questions?"

You: "Thank you! I'd love to hear more about how your team handles the cold start problem for new content β€” how quickly can fresh content appear in search results?"


Summary: Concepts Applied

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              CONCEPTS FROM 10-WEEK COURSE IN GOOGLE SEARCH                   β”‚
β”‚                                                                              β”‚
β”‚   WEEK 1: DATA AT SCALE                                                      β”‚
β”‚   β”œβ”€β”€ Partitioning: Index sharded by document ID                             β”‚
β”‚   β”œβ”€β”€ Replication: Each shard replicated 3-10x                               β”‚
β”‚   └── Hot Keys: Popular queries cached, hot shards replicated                β”‚
β”‚                                                                              β”‚
β”‚   WEEK 2: FAILURE-FIRST DESIGN                                               β”‚
β”‚   β”œβ”€β”€ Timeout budgets: Query stages have strict time limits                  β”‚
β”‚   β”œβ”€β”€ Graceful degradation: Return partial results on timeout                β”‚
β”‚   └── Circuit breakers: Stop sending to unhealthy shards                     β”‚
β”‚                                                                              β”‚
β”‚   WEEK 3: MESSAGING & ASYNC                                                  β”‚
β”‚   β”œβ”€β”€ Async processing: Crawler uses async IO                                β”‚
β”‚   β”œβ”€β”€ Rate limiting: Per-domain crawl limits                                 β”‚
β”‚   └── Backpressure: URL frontier limits prevent exhaustion                   β”‚
β”‚                                                                              β”‚
β”‚   WEEK 5: CONSISTENCY & COORDINATION                                         β”‚
β”‚   β”œβ”€β”€ Eventual consistency: Index updates over minutes/hours                 β”‚
β”‚   └── Distributed computation: PageRank via MapReduce                        β”‚
β”‚                                                                              β”‚
β”‚   WEEK 10: PRODUCTION READINESS                                              β”‚
β”‚   β”œβ”€β”€ SLOs: Query latency p99 < 500ms, 99.99% availability                   β”‚
β”‚   └── Observability: QPS, latency, error rates, quality signals              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                              β”‚
β”‚              WHY GOOGLE SEARCH IS AN ENGINEERING MARVEL                      β”‚
β”‚                                                                              β”‚
β”‚   SCALE                                                                      β”‚
β”‚   ─────                                                                      β”‚
β”‚   β€’ 8.5+ billion queries per day answered in milliseconds                    β”‚
β”‚   β€’ 400+ billion documents indexed and searchable                            β”‚
β”‚   β€’ 130+ trillion URLs known to exist on the web                             β”‚
β”‚                                                                              β”‚
β”‚   LATENCY                                                                    β”‚
β”‚   ───────                                                                    β”‚
β”‚   β€’ Sub-500ms response time for complex queries                              β”‚
β”‚   β€’ Results from 10,000+ index shards merged in real-time                    β”‚
β”‚                                                                              β”‚
β”‚   QUALITY                                                                    β”‚
β”‚   ───────                                                                    β”‚
β”‚   β€’ PageRank revolutionized web search quality                               β”‚
β”‚   β€’ 200+ ranking signals for relevance                                       β”‚
β”‚   β€’ 99%+ spam detection rate                                                 β”‚
β”‚                                                                              β”‚
β”‚   "Organizing the world's information and making it universally              β”‚
β”‚    accessible and useful" β€” achieved at unprecedented scale.                 β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Sources

Official Documentation:

Statistics:

Architecture:


Further Reading

Books:

  • "Introduction to Information Retrieval" by Manning, Raghavan, SchΓΌtze
  • "The Google Story" by David Vise

Papers:

  • "The PageRank Citation Ranking" - Page & Brin, 1998
  • "MapReduce: Simplified Data Processing" - Dean & Ghemawat, 2004
  • "The Google File System" - Ghemawat et al., 2003

Self-Assessment Checklist

After studying this case study, you should be able to:

  • Explain how an inverted index works
  • Describe the PageRank algorithm
  • Design a distributed crawling system
  • Explain how to shard an index across thousands of machines
  • Describe the query processing pipeline with timeout budgets
  • Calculate back-of-envelope estimates for a search engine
  • Explain BM25 scoring
  • Describe strategies for index freshness
  • Design monitoring and alerting for search quality

End of Bonus Problem 5: Google Search : Page with good content (many pages link TO it)

  • Hub: Page that links to good content (links TO many authorities)

Mutually reinforcing:

  • Good hubs link to good authorities
  • Good authorities are linked from good hubs

Algorithm:

  1. Initialize all pages with authority=1, hub=1
  2. Authority update: auth(p) = Ξ£ hub(q) for all q linking to p
  3. Hub update: hub(p) = Ξ£ auth(q) for all q that p links to
  4. Normalize scores
  5. Repeat until convergence

Use case:

  • Query-specific ranking (PageRank is query-independent)
  • Find topic experts and resource pages

E. Real-Time Search Signals

Query Click Data

CLICK DATA FOR RANKING

Google uses billions of search clicks to improve ranking:

Signals extracted:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                             β”‚
β”‚   CLICK-THROUGH RATE (CTR)                                                  β”‚
β”‚   ─────────────────────────                                                 β”‚
β”‚   Higher CTR = More relevant for that query                                 β”‚
β”‚   Adjusted for position (top results naturally get more clicks)             β”‚
β”‚                                                                             β”‚
β”‚   DWELL TIME                                                                β”‚
β”‚   ──────────                                                                β”‚
β”‚   Time spent on page after clicking                                         β”‚
β”‚   Long dwell = User found what they wanted                                  β”‚
β”‚   Short dwell + back button = Poor result ("pogo-sticking")                 β”‚
β”‚                                                                             β”‚
β”‚   QUERY REFORMULATION                                                       β”‚
β”‚   ─────────────────────                                                     β”‚
β”‚   User searches again with different terms?                                 β”‚
β”‚   Original results were poor                                                β”‚
β”‚                                                                             β”‚
β”‚   LAST CLICK                                                                β”‚
β”‚   ──────────                                                                β”‚
β”‚   The final result user clicked (didn't come back)                          β”‚
β”‚   Strong signal of satisfaction                                             β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Privacy considerations:
- Data is aggregated (no individual tracking)
- Minimum threshold of queries before using signal
- Used for ranking, not targeting

Breaking News Detection

BREAKING NEWS PIPELINE

Goal: Index critical news within minutes

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    BREAKING NEWS ARCHITECTURE                              β”‚
β”‚                                                                            β”‚
β”‚   SIGNAL SOURCES                                                           β”‚
β”‚   ──────────────                                                           β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚   β”‚  Twitter   β”‚ β”‚   Google   β”‚ β”‚    News    β”‚ β”‚  Google    β”‚              β”‚
β”‚   β”‚  Trends    β”‚ β”‚   Trends   β”‚ β”‚   Feeds    β”‚ β”‚   Alerts   β”‚              β”‚
β”‚   β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚         β”‚              β”‚              β”‚              β”‚                     β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚                                β”‚                                           β”‚
β”‚                                β–Ό                                           β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚   β”‚              BREAKING NEWS DETECTOR                         β”‚          β”‚
β”‚   β”‚                                                             β”‚          β”‚
β”‚   β”‚  β€’ Spike detection in query volume                          β”‚          β”‚
β”‚   β”‚  β€’ New entity mentions (people, places, events)             β”‚          β”‚
β”‚   β”‚  β€’ Sentiment shift detection                                β”‚          β”‚
β”‚   β”‚  β€’ Cross-source corroboration                               β”‚          β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                                  β”‚                                         β”‚
β”‚                                  β–Ό                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚   β”‚              PRIORITY CRAWL QUEUE                           β”‚          β”‚
β”‚   β”‚                                                             β”‚          β”‚
β”‚   β”‚  β€’ Jump queue for news sources                              β”‚          β”‚
β”‚   β”‚  β€’ Crawl trusted sources first (AP, Reuters, BBC)           β”‚          β”‚
β”‚   β”‚  β€’ Multiple crawls as story develops                        β”‚          β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                                  β”‚                                         β”‚
β”‚                                  β–Ό                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚   β”‚              REAL-TIME INDEX (TIER 0)                       β”‚          β”‚
β”‚   β”‚                                                             β”‚          β”‚
β”‚   β”‚  β€’ Bypasses normal index build pipeline                     β”‚          β”‚
β”‚   β”‚  β€’ Direct injection into serving index                      β”‚          β”‚
β”‚   β”‚  β€’ Searchable within 2-5 minutes                            β”‚          β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                                                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

LOCATION-AWARE SEARCH

User query: "pizza"

Without location: Generic results about pizza

With location (user in NYC):
  1. Detect user location (IP, GPS, profile)
  2. Expand query: "pizza" β†’ "pizza near [lat, lng]"
  3. Boost local business results
  4. Show map pack with nearby options

Location signals:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                             β”‚
β”‚   EXPLICIT                                                                  β”‚
β”‚   ────────                                                                  β”‚
β”‚   β€’ Query contains location: "pizza in brooklyn"                            β”‚
β”‚   β€’ User profile location setting                                           β”‚
β”‚   β€’ GPS from mobile device                                                  β”‚
β”‚                                                                             β”‚
β”‚   IMPLICIT                                                                  β”‚
β”‚   ────────                                                                  β”‚
β”‚   β€’ IP geolocation (~city level accuracy)                                   β”‚
β”‚   β€’ WiFi network location                                                   β”‚
β”‚   β€’ Search history patterns                                                 β”‚
β”‚   β€’ Time zone from browser                                                  β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Search Personalization

PERSONALIZATION SIGNALS

Google personalizes results based on:

1. SEARCH HISTORY
   - Previous queries indicate interests
   - "java" for programmer β†’ programming results
   - "java" for traveler β†’ Indonesia results

2. LOCATION HISTORY
   - Frequent locations affect local results
   - Home vs work vs travel

3. DEVICE TYPE
   - Mobile: Emphasize mobile-friendly pages
   - Desktop: May show more detailed results

4. LANGUAGE PREFERENCES
   - Browser language
   - Past language selections

5. GOOGLE ACCOUNT DATA
   - Gmail (travel confirmations, purchases)
   - Calendar (upcoming events)
   - YouTube (interests)

Privacy controls:
- Users can disable personalization
- Incognito mode = no personalization
- Data can be deleted from account

G. Search Quality Evaluation

Human Raters and E-E-A-T

SEARCH QUALITY EVALUATION

Google employs 10,000+ human "Search Quality Raters" who:
- Evaluate search results for quality
- Don't directly affect rankings
- Train ML models and validate algorithm changes

E-E-A-T Framework (Experience, Expertise, Authoritativeness, Trustworthiness):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                             β”‚
β”‚   EXPERIENCE                                                                β”‚
β”‚   ──────────                                                                β”‚
β”‚   Does the content creator have first-hand experience?                      β”‚
β”‚   Example: Product review from someone who actually used it                 β”‚
β”‚                                                                             β”‚
β”‚   EXPERTISE                                                                 β”‚
β”‚   ─────────                                                                 β”‚
β”‚   Does the creator have relevant knowledge/skill?                           β”‚
β”‚   Example: Medical advice from a doctor                                     β”‚
β”‚                                                                             β”‚
β”‚   AUTHORITATIVENESS                                                         β”‚
β”‚   ─────────────────                                                         β”‚
β”‚   Is the creator/site recognized as a go-to source?                         β”‚
β”‚   Example: Government site for tax information                              β”‚
β”‚                                                                             β”‚
β”‚   TRUSTWORTHINESS                                                           β”‚
β”‚   ────────────────                                                          β”‚
β”‚   Is the content accurate, honest, safe?                                    β”‚
β”‚   Example: Secure site, accurate facts, transparent authorship              β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

YMYL (Your Money or Your Life) pages:
- Health, finance, safety, legal topics
- Held to higher E-E-A-T standards
- Misinformation can cause real harm

A/B Testing Search Changes

SEARCH EXPERIMENT FRAMEWORK

Every ranking change is A/B tested:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                             β”‚
β”‚   EXPERIMENT SETUP                                                          β”‚
β”‚   ────────────────                                                          β”‚
β”‚   β€’ Control group: Current algorithm                                        β”‚
β”‚   β€’ Treatment group: New algorithm                                          β”‚
β”‚   β€’ Random 1-5% of traffic to treatment                                     β”‚
β”‚                                                                             β”‚
β”‚   METRICS MEASURED                                                          β”‚
β”‚   ────────────────                                                          β”‚
β”‚   β€’ Click-through rate (CTR)                                                β”‚
β”‚   β€’ Time to first click                                                     β”‚
β”‚   β€’ Pogo-stick rate (quick backs)                                           β”‚
β”‚   β€’ Query abandonment rate                                                  β”‚
β”‚   β€’ Long clicks (dwell time > 30s)                                          β”‚
β”‚   β€’ Query reformulation rate                                                β”‚
β”‚                                                                             β”‚
β”‚   LAUNCH CRITERIA                                                           β”‚
β”‚   ───────────────                                                           β”‚
β”‚   β€’ Statistically significant improvement                                   β”‚
β”‚   β€’ No regression on key metrics                                            β”‚
β”‚   β€’ Human rater evaluation positive                                         β”‚
β”‚   β€’ No increase in policy violations                                        β”‚
β”‚                                                                             β”‚
β”‚   Google runs 10,000+ experiments per year                                  β”‚
β”‚   Only ~2,000 changes actually launch                                       β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Neural Ranking Models

EVOLUTION OF RANKING

Traditional (2000s):
  TF-IDF + PageRank + Hand-crafted features
  β†’ Linear combination of ~50 signals

Learning to Rank (2010s):
  Gradient boosted trees (LambdaMART)
  β†’ Nonlinear combination of ~200 signals

Neural Ranking (2020s):
  BERT, T5, and transformer models
  β†’ Semantic understanding of queries and documents

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                             β”‚
β”‚   BERT IN SEARCH                                                            β”‚
β”‚   ──────────────                                                            β”‚
β”‚                                                                             β”‚
β”‚   Query: "can you get medicine for someone pharmacy"                        β”‚
β”‚                                                                             β”‚
β”‚   Before BERT:                                                              β”‚
β”‚   β€’ Matches keywords: medicine, pharmacy                                    β”‚
β”‚   β€’ Misses intent: picking up prescription for another person               β”‚
β”‚                                                                             β”‚
β”‚   With BERT:                                                                β”‚
β”‚   β€’ Understands "for someone" = on behalf of another person                 β”‚
β”‚   β€’ Returns results about pharmacy pickup policies                          β”‚
β”‚                                                                             β”‚
β”‚   Key insight: BERT understands prepositions and context                    β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Generative AI Search (AI Overviews)

AI OVERVIEWS (2024-2025)

Google now generates AI summaries for many queries:

Query: "how to remove coffee stains from carpet"

Traditional results:
  1. [Link to WikiHow article]
  2. [Link to cleaning blog]
  3. [Link to product review]

AI Overview:
  "To remove coffee stains from carpet:
   1. Blot (don't rub) excess liquid immediately
   2. Mix 1 tablespoon dish soap + 1 tablespoon vinegar + 2 cups water
   3. Apply solution to stain, blot with clean cloth
   4. Rinse with cold water and blot dry
   
   [Sources: WikiHow, Good Housekeeping, The Spruce]"

Challenges:
β€’ Attribution and copyright
β€’ Accuracy verification
β€’ Impact on publisher traffic
β€’ Hallucination prevention
β€’ 30x more compute per query

This comprehensive appendix covers advanced topics in search engine design that demonstrate the depth and complexity of building a system like Google Search. These concepts build on the foundational Week 1-10 curriculum and show how theory translates to production systems at massive scale.


End of Bonus Problem 5: Google Search

Document Statistics:

  • Total concepts covered: 25+
  • Code implementations: 5
  • Architecture diagrams: 15+
  • Real-world scale numbers: 50+