Bonus Problem 5: Google Search
The World's Most Complex Information Retrieval System
π Answering 8.5 Billion Questions a Day β In Milliseconds
Imagine this challenge: Someone types three words into a search box.
Within 200 milliseconds, you must search through 400 billion documents β the equivalent of every book ever written, times 100,000 β find the most relevant results, rank them by quality and relevance, check for spam, personalize the results, and display them in a beautiful interface.
Now do that 100,000 times per second, every second, 24/7.
This is Google Search β the most sophisticated information retrieval system ever built.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β THE GOOGLE SEARCH SCALE (2025) β
β β
β QUERY VOLUME β
β ββββββββββββ β
β Daily searches: 8.5 - 9.1 Billion β
β Searches per second: ~100,000 peak β
β Annual searches: 5+ Trillion β
β Market share: 91.5% globally β
β β
β INDEX SIZE β
β ββββββββββ β
β Indexed documents: ~400 Billion β
β Known URLs: 130+ Trillion β
β Index storage: 100+ Petabytes β
β New URLs discovered: Billions per day β
β β
β INFRASTRUCTURE β
β ββββββββββββββ β
β Data centers: 30+ worldwide β
β Servers (estimated): 2.5+ Million β
β Power consumption: ~18.3 TWh/year β
β Response time target: <200ms (p99) β
β β
β BUSINESS IMPACT β
β βββββββββββββββ β
β Search ad revenue (2024): \$175+ Billion β
β Revenue per search: ~\$0.03 β
β Users worldwide: 5+ Billion β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This is the system we'll design today β and understand how search engines achieve the impossible: sub-second retrieval from hundreds of billions of documents.
The Interview Begins
You walk into the interview room. The interviewer smiles and gestures to the whiteboard.
Interviewer: "Thanks for coming in. Today we're going to work through a system design problem together. I want to see how you think about large-scale distributed systems. Feel free to ask questions β this should be collaborative."
They write on the whiteboard:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Design a Web Search Engine β
β β
β Build a search engine that can: β
β - Index billions of web pages β
β - Return relevant results in <500ms β
β - Handle 10,000+ queries per second β
β - Keep the index fresh (new content discoverable within hours) β
β - Rank results by relevance and quality β
β β
β Consider: β
β - How do you discover new pages? β
β - How do you store and query the index efficiently? β
β - How do you rank billions of documents? β
β - How do you handle 100K QPS with sub-second latency? β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interviewer: "Take a few minutes to think about this, then walk me through your approach. We have about 45 minutes."
Phase 1: Requirements Clarification (5 minutes)
Before diving in, you take a breath and start asking questions.
Your Questions
You: "Before I start designing, I'd like to clarify a few requirements. First, what's the scale of the web we need to index? Are we talking millions or billions of pages?"
Interviewer: "Let's say we need to index around 50 billion pages initially, with the capability to scale to hundreds of billions."
You: "And what's our query volume? Peak QPS?"
Interviewer: "Target 10,000 QPS initially, scaling to 100,000 QPS."
You: "What about freshness requirements? How quickly should new content appear in search results?"
Interviewer: "Critical news should appear within minutes. Regular content within hours. Most content within a few days is acceptable."
You: "For ranking, should we implement something like PageRank, or simpler relevance scoring?"
Interviewer: "Start with basic relevance like TF-IDF, then layer in link-based authority scoring like PageRank."
You: "Perfect. Let me summarize the requirements."
Functional Requirements
1. WEB CRAWLING
- Discover and download web pages continuously
- Respect robots.txt and crawl rate limits
- Handle various content types (HTML, PDF, etc.)
- Track page freshness and update frequency
2. INDEXING
- Parse and extract text content from pages
- Build searchable index from page content
- Store page metadata (title, URL, timestamp)
- Support incremental index updates
3. QUERY PROCESSING
- Parse and understand user queries
- Retrieve matching documents efficiently
- Rank results by relevance and quality
- Support features: autocomplete, spell check, filters
4. RANKING
- Text relevance scoring (TF-IDF, BM25)
- Link-based authority (PageRank)
- Freshness signals
- Spam detection and filtering
Non-Functional Requirements
1. SCALE
- 50+ billion indexed pages
- 100+ petabytes of index data
- 100,000 QPS at peak
2. LATENCY
- Query response: <500ms p99
- Index freshness: <24 hours for most content
- Breaking news: <15 minutes
3. AVAILABILITY
- 99.99% uptime
- Graceful degradation under load
- Multi-region deployment
4. CONSISTENCY
- Eventually consistent index
- Stale reads acceptable (within hours)
- Strong consistency for user data only
Phase 2: Back of the Envelope Estimation (5 minutes)
You: "Let me work through the numbers to understand the scale we're dealing with."
Index Size Estimation
DOCUMENT STORAGE
Pages to index: 50 billion
Average page size (compressed): 50 KB
Raw page storage: 50B Γ 50KB = 2.5 PB
INVERTED INDEX
Unique terms: ~10 billion
Average postings per term: ~1 million
Posting entry size: 12 bytes (docID + position + metadata)
Inverted index size: 10B Γ 1M Γ 12B = 120 PB
With compression (5x): ~25 PB
With replication (3x): ~75 PB total
FORWARD INDEX (document β terms)
Per-document metadata: 1 KB
Forward index: 50B Γ 1KB = 50 TB
With replication: 150 TB
Query Traffic Estimation
QUERY VOLUME
Peak QPS: 100,000
Queries per day: ~5 billion
Average query length: 3.4 words
QUERY PROCESSING
Documents to score per query: ~10,000 candidates
Scoring time per doc: ~10 microseconds
Total scoring time: ~100ms
Results to return: 10-20 per page
BANDWIDTH
Query request size: ~500 bytes
Response size: ~50 KB
Peak bandwidth: 100K Γ 50KB = 5 GB/s
Infrastructure Estimation
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INFRASTRUCTURE SUMMARY β
β β
β STORAGE β
β βββ Index storage: ~100 PB β
β βββ Document storage: ~10 PB (compressed) β
β βββ Metadata/logs: ~5 PB β
β β
β COMPUTE β
β βββ Index servers: ~10,000 machines β
β βββ Crawlers: ~1,000 machines β
β βββ Query servers: ~5,000 machines β
β βββ Ranking/ML: ~2,000 machines β
β β
β NETWORK β
β βββ Internal bandwidth: 100+ Tbps β
β βββ External bandwidth: 10+ Tbps β
β β
β DATA CENTERS β
β βββ Minimum: 5-7 regions globally β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Phase 3: High-Level Design (10 minutes)
You: "Now let me sketch out the high-level architecture. A search engine has three major components: crawling, indexing, and serving."
System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SEARCH ENGINE ARCHITECTURE β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CRAWLING PIPELINE β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β URL ββββββΆβ Crawler ββββββΆβ Content ββββββΆβ Page β β β
β β β Frontierβ β Workers β β Parser β β Store β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β INDEXING PIPELINE β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β βDocument ββββββΆβTokenizerββββββΆβ Index ββββββΆβInverted β β β
β β β Parser β β/Analyzerβ β Builder β β Index β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β βββββββββββ β β
β β βPageRank βββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Compute β β β
β β βββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SERVING LAYER β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β Query ββββββΆβRetrievalββββββΆβ Ranking ββββββΆβ Results β β β
β β β Parser β β Engine β β Engine β β Server β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Breakdown
1. Crawling Pipeline
Purpose: Continuously discover and download web pages from the internet.
Key components:
- URL Frontier: Priority queue of URLs to crawl, ordered by importance and freshness
- Crawler Workers: Distributed fetchers that download pages, respecting rate limits
- Content Parser: Extract text, links, and metadata from HTML
- Link Extractor: Discover new URLs to feed back to the frontier
2. Indexing Pipeline
Purpose: Process crawled pages and build searchable data structures.
Key components:
- Document Parser: Clean HTML, extract text, handle encoding
- Tokenizer/Analyzer: Split text into tokens, apply stemming, remove stopwords
- Index Builder: Construct inverted index data structures
- PageRank Compute: Calculate link-based authority scores
3. Serving Layer
Purpose: Handle user queries and return ranked results.
Key components:
- Query Parser: Parse query, expand synonyms, handle operators
- Retrieval Engine: Find candidate documents from inverted index
- Ranking Engine: Score and sort candidates by relevance
- Results Server: Format and return final results
Phase 4: Deep Dives (20 minutes)
Interviewer: "Great overview. Let's dive deeper into some challenging aspects. How does the inverted index actually work?"
Deep Dive 1: The Inverted Index (Week 1 - Partitioning & Sharding)
You: "The inverted index is the heart of any search engine. Let me explain how it works and how we scale it."
What is an Inverted Index?
INVERTED INDEX STRUCTURE
Traditional index (forward):
ββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β Document β Words β
ββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ€
β Doc 1 β "the quick brown fox jumps" β
β Doc 2 β "the lazy brown dog sleeps" β
β Doc 3 β "quick fox runs fast" β
ββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ
Inverted index (reverse):
ββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β Term β Postings List (DocID:Positions) β
ββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ€
β "brown" β [(Doc1, [3]), (Doc2, [3])] β
β "fox" β [(Doc1, [4]), (Doc3, [2])] β
β "quick" β [(Doc1, [2]), (Doc3, [1])] β
β "lazy" β [(Doc2, [2])] β
ββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ
To find "quick fox":
1. Look up "quick" β {Doc1, Doc3}
2. Look up "fox" β {Doc1, Doc3}
3. Intersect β {Doc1, Doc3}
4. Score and rank
Sharding the Index
You: "With 400 billion documents and 10 billion unique terms, we need to shard the index across thousands of machines."
INDEX SHARDING STRATEGIES
Option 1: Document-based sharding
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Each shard contains ALL terms for a SUBSET of documents β
β β
β Shard 1: Docs 0-1M Shard 2: Docs 1M-2M Shard N: ... β
β ββββββββββββββββββ ββββββββββββββββββ β
β β "apple" β [...]β β "apple" β [...]β β
β β "banana"β [...]β β "banana"β [...]β β
β ββββββββββββββββββ ββββββββββββββββββ β
β β
β Query: Search ALL shards in parallel, merge results β
β Pro: Each shard is self-contained β
β Con: Every query hits every shard β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Option 2: Term-based sharding
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Each shard contains ALL documents for a SUBSET of terms β
β β
β Shard A-F: Terms A-F Shard G-L: Terms G-L β
β ββββββββββββββββββ ββββββββββββββββββ β
β β "apple" β all β β "guitar"β all β β
β β "banana"β all β β "house" β all β β
β ββββββββββββββββββ ββββββββββββββββββ β
β β
β Query: Only hit shards containing query terms β
β Pro: Fewer shards per query β
β Con: Cross-shard coordination for scoring β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Google uses: Document-based sharding (with tiered index)
Implementation
# inverted_index/index_shard.py
"""
Inverted Index Shard Implementation
Demonstrates the core data structures used in search engines.
"""
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from collections import defaultdict
import bisect
import math
@dataclass
class Posting:
"""A single posting entry in the inverted index."""
doc_id: int
term_frequency: int
positions: List[int] = field(default_factory=list)
@dataclass
class PostingsList:
"""List of all postings for a single term."""
term: str
doc_frequency: int
postings: List[Posting] = field(default_factory=list)
def intersect(self, other: 'PostingsList') -> 'PostingsList':
"""
Intersect two posting lists (for AND queries).
Uses merge algorithm - O(n + m).
"""
result = PostingsList(
term=f"{self.term} AND {other.term}",
doc_frequency=0,
postings=[]
)
i, j = 0, 0
while i < len(self.postings) and j < len(other.postings):
if self.postings[i].doc_id == other.postings[j].doc_id:
combined = Posting(
doc_id=self.postings[i].doc_id,
term_frequency=self.postings[i].term_frequency +
other.postings[j].term_frequency,
positions=sorted(self.postings[i].positions +
other.postings[j].positions)
)
result.postings.append(combined)
i += 1
j += 1
elif self.postings[i].doc_id < other.postings[j].doc_id:
i += 1
else:
j += 1
result.doc_frequency = len(result.postings)
return result
class InvertedIndexShard:
"""
A single shard of the inverted index.
In production:
- Each shard handles ~1-10M documents
- Index is memory-mapped for fast access
- Postings are sorted by doc_id for efficient intersection
"""
def __init__(self, shard_id: int, doc_range: tuple):
self.shard_id = shard_id
self.doc_range = doc_range
self.index: Dict[str, PostingsList] = {}
self.doc_count = 0
def add_document(self, doc_id: int, tokens: List[str]) -> None:
"""Index a single document."""
term_positions: Dict[str, List[int]] = defaultdict(list)
for position, token in enumerate(tokens):
term_positions[token].append(position)
for term, positions in term_positions.items():
if term not in self.index:
self.index[term] = PostingsList(term=term, doc_frequency=0)
posting = Posting(
doc_id=doc_id,
term_frequency=len(positions),
positions=positions
)
bisect.insort(
self.index[term].postings,
posting,
key=lambda p: p.doc_id
)
self.index[term].doc_frequency += 1
self.doc_count += 1
def search(self, query_terms: List[str]) -> PostingsList:
"""Search for documents containing all query terms."""
if not query_terms:
return PostingsList(term="", doc_frequency=0)
postings_lists = []
for term in query_terms:
pl = self.index.get(term)
if pl is None:
return PostingsList(term=term, doc_frequency=0)
postings_lists.append(pl)
# Sort by document frequency (shortest first)
postings_lists.sort(key=lambda pl: pl.doc_frequency)
# Intersect all posting lists
result = postings_lists[0]
for pl in postings_lists[1:]:
result = result.intersect(pl)
if result.doc_frequency == 0:
break
return result
def compute_idf(self, doc_frequency: int) -> float:
"""Compute Inverse Document Frequency."""
if doc_frequency == 0:
return 0.0
return math.log(self.doc_count / doc_frequency)
Deep Dive 2: PageRank Algorithm (Week 5 - Consistency & Coordination)
Interviewer: "Tell me about PageRank. How does Google determine page importance?"
You: "PageRank is the algorithm that made Google famous. It's essentially a voting system where links are votes."
The PageRank Concept
PAGERANK: THE RANDOM SURFER MODEL
Imagine a random web surfer:
- Start on a random page
- Follow a random link on that page
- Repeat forever
PageRank = probability that the random surfer lands on a page
Pages with more incoming links β higher probability β higher PageRank
The Mathematics
PAGERANK FORMULA
PR(A) = (1-d)/N + d Γ Ξ£(PR(i)/L(i))
Where:
PR(A) = PageRank of page A
d = Damping factor (typically 0.85)
N = Total number of pages
PR(i) = PageRank of pages linking to A
L(i) = Number of outgoing links from page i
The damping factor represents the probability that the random
surfer gets bored and jumps to a random page instead of following links.
Implementation
# ranking/pagerank.py
"""
PageRank Computation at Scale
Demonstrates how to compute PageRank for billions of pages.
"""
from dataclasses import dataclass
from typing import Dict, List
@dataclass
class WebPage:
"""Represents a web page in the link graph."""
url: str
outgoing_links: List[str]
incoming_links: List[str] = None
pagerank: float = 1.0
class PageRankComputer:
"""
Computes PageRank scores for a web graph.
At Google scale:
- Graph has 100+ billion nodes
- Computation takes days on thousands of machines
- Uses MapReduce for distributed iteration
"""
def __init__(self, damping_factor: float = 0.85, max_iterations: int = 100,
convergence_threshold: float = 1e-6):
self.damping_factor = damping_factor
self.max_iterations = max_iterations
self.convergence_threshold = convergence_threshold
self.pages: Dict[str, WebPage] = {}
self.num_pages = 0
def add_page(self, url: str, outgoing_links: List[str]) -> None:
"""Add a page to the graph."""
self.pages[url] = WebPage(url=url, outgoing_links=outgoing_links)
for link in outgoing_links:
if link not in self.pages:
self.pages[link] = WebPage(url=link, outgoing_links=[])
self.num_pages = len(self.pages)
def build_reverse_links(self) -> None:
"""Build incoming link lists for each page."""
for page in self.pages.values():
page.incoming_links = []
for url, page in self.pages.items():
for link in page.outgoing_links:
if link in self.pages:
self.pages[link].incoming_links.append(url)
def compute(self) -> Dict[str, float]:
"""
Compute PageRank using power iteration.
Algorithm:
1. Initialize all pages with equal rank (1/N)
2. For each iteration:
- Sum contributions from incoming links
- Apply damping factor
3. Repeat until convergence
"""
if not self.pages:
return {}
self.build_reverse_links()
# Initialize PageRank values
initial_rank = 1.0 / self.num_pages
for page in self.pages.values():
page.pagerank = initial_rank
# Iterative computation
for iteration in range(self.max_iterations):
new_ranks: Dict[str, float] = {}
# Handle dangling pages
dangling_sum = sum(
page.pagerank for page in self.pages.values()
if len(page.outgoing_links) == 0
)
dangling_contribution = dangling_sum / self.num_pages
# Compute new rank for each page
for url, page in self.pages.items():
new_rank = (1 - self.damping_factor) / self.num_pages
new_rank += self.damping_factor * dangling_contribution
for source_url in page.incoming_links:
source_page = self.pages[source_url]
num_outlinks = len(source_page.outgoing_links)
if num_outlinks > 0:
contribution = source_page.pagerank / num_outlinks
new_rank += self.damping_factor * contribution
new_ranks[url] = new_rank
# Check for convergence
max_diff = max(
abs(new_ranks[url] - self.pages[url].pagerank)
for url in self.pages
)
for url, rank in new_ranks.items():
self.pages[url].pagerank = rank
if max_diff < self.convergence_threshold:
print(f"Converged after {iteration + 1} iterations")
break
return {url: page.pagerank for url, page in self.pages.items()}
Deep Dive 3: Query Processing Pipeline (Week 2 - Timeouts & Latency)
Interviewer: "Walk me through what happens when a user types a query."
You: "Query processing is where all the latency pressure is. We have 500ms to do everything."
Query Processing Flow
QUERY PROCESSING TIMELINE (Target: <500ms)
User types: "best pizza near me"
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: QUERY UNDERSTANDING (20ms budget) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β βSpell Check β β Tokenize β β Synonyms β β Intent β β
β β"piza"β"pizza"β β Split words β β bestβtop β β Local search β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: RETRIEVAL (100ms budget) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Query sent to all index shards in parallel β
β Intersect posting lists, return top 1000 candidates per shard β
β Total candidates: ~10,000 documents β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: SCORING (200ms budget) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β For each candidate, compute scores: β
β - Text Relevance (BM25): Weight 0.3 β
β - PageRank: Weight 0.2 β
β - Freshness: Weight 0.1 β
β - Location Match: Weight 0.2 β
β - Click History: Weight 0.2 β
β Sort by final score, take top 100 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 4: SNIPPET GENERATION (50ms budget) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β For top 10 results: β
β - Fetch document content (or cached snippet) β
β - Find best passage containing query terms β
β - Highlight matching terms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 5: RESPONSE (30ms budget) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Total time: ~400ms (target <500ms) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
BM25 Scoring Implementation
# query/scorer.py
"""
BM25 Ranking Algorithm
An improvement over TF-IDF that handles term frequency saturation
and document length normalization.
"""
import math
from typing import Dict, List, Optional
class BM25Scorer:
"""
BM25 scoring for search relevance.
BM25 is the industry standard for text relevance scoring.
It improves on TF-IDF by:
- Saturating term frequency (diminishing returns)
- Normalizing for document length
"""
K1 = 1.2 # Term frequency saturation parameter
B = 0.75 # Length normalization parameter
def __init__(self, avg_doc_length: float, total_docs: int):
self.avg_doc_length = avg_doc_length
self.total_docs = total_docs
def score(self, doc: Dict, query_terms: List[str]) -> float:
"""
Compute BM25 score for a document.
Args:
doc: Document with term_freqs, doc_length, doc_freqs
query_terms: List of query terms
Returns:
BM25 score (higher = more relevant)
"""
score = 0.0
doc_length = doc.get("doc_length", self.avg_doc_length)
term_freqs = doc.get("term_freqs", {})
for term in query_terms:
tf = term_freqs.get(term, 0)
if tf == 0:
continue
# IDF component
df = doc.get("doc_freqs", {}).get(term, 1)
idf = math.log((self.total_docs - df + 0.5) / (df + 0.5) + 1)
# TF component with saturation
numerator = tf * (self.K1 + 1)
denominator = tf + self.K1 * (
1 - self.B + self.B * (doc_length / self.avg_doc_length)
)
score += idf * (numerator / denominator)
return score
def compute_freshness(self, last_updated) -> float:
"""Compute freshness score with exponential decay."""
from datetime import datetime
if last_updated is None:
return 0.5
age_days = (datetime.utcnow() - last_updated).days
half_life = 30 # 30-day half-life
return math.exp(-0.693 * age_days / half_life)
Deep Dive 4: Web Crawling at Scale (Week 3 - Async Processing)
Interviewer: "How do you crawl billions of pages efficiently?"
You: "Crawling is fundamentally an async, distributed problem."
Crawling Architecture
WEB CRAWLING PIPELINE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β URL FRONTIER β
β Priority Queue of URLs to crawl, ordered by: β
β - Importance (PageRank of linking pages) β
β - Freshness (time since last crawl) β
β - Domain priority (news sites crawled more often) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CRAWLER WORKERS β
β 1000s of distributed workers fetching pages in parallel β
β β
β POLITENESS CONSTRAINTS: β
β - Max 1 request per second per domain β
β - Respect robots.txt β
β - Respect Crawl-Delay directive β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTENT PROCESSOR β
β Parse HTML β Extract Links & Text β Detect Language β Store Content β
β New URLs fed back to URL Frontier β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Crawler Implementation
# crawler/web_crawler.py
"""
Distributed Web Crawler
Demonstrates key crawling concepts:
- Politeness (rate limiting per domain)
- Priority-based URL frontier
- Parallel fetching with async IO
"""
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Set
from datetime import datetime
from urllib.parse import urlparse, urljoin
import asyncio
import heapq
@dataclass
class CrawlURL:
"""A URL in the crawl frontier."""
url: str
priority: float
last_crawl: Optional[datetime] = None
depth: int = 0
def __lt__(self, other):
return self.priority > other.priority
class URLFrontier:
"""
Priority queue of URLs to crawl.
Key features:
- Priority-based ordering
- Deduplication
- Per-domain queuing for politeness
"""
def __init__(self, max_size: int = 10_000_000):
self.max_size = max_size
self.queue: List[CrawlURL] = []
self.seen_urls: Set[str] = set()
self.domain_last_access: Dict[str, datetime] = {}
def add(self, url: str, priority: float = 0.5, depth: int = 0) -> bool:
"""Add URL to frontier if not seen."""
url = url.lower().rstrip("/").split("#")[0]
if url in self.seen_urls:
return False
if len(self.seen_urls) >= self.max_size:
return False
crawl_url = CrawlURL(url=url, priority=priority, depth=depth)
heapq.heappush(self.queue, crawl_url)
self.seen_urls.add(url)
return True
def get_next(self) -> Optional[CrawlURL]:
"""Get next URL respecting politeness constraints."""
if not self.queue:
return None
now = datetime.utcnow()
while self.queue:
crawl_url = heapq.heappop(self.queue)
domain = urlparse(crawl_url.url).netloc
last_access = self.domain_last_access.get(domain)
if last_access and (now - last_access).total_seconds() < 1.0:
heapq.heappush(self.queue, crawl_url)
continue
self.domain_last_access[domain] = now
return crawl_url
return None
class WebCrawler:
"""
Distributed web crawler with async IO.
"""
def __init__(self, num_workers: int = 100, max_pages: int = 1_000_000):
self.num_workers = num_workers
self.max_pages = max_pages
self.frontier = URLFrontier()
self.pages_crawled = 0
def add_seeds(self, seed_urls: List[str]) -> None:
"""Add seed URLs to start crawling."""
for url in seed_urls:
self.frontier.add(url, priority=1.0, depth=0)
async def crawl(self) -> None:
"""Start crawling with multiple workers."""
workers = [
asyncio.create_task(self._worker(i))
for i in range(self.num_workers)
]
await asyncio.gather(*workers)
async def _worker(self, worker_id: int) -> None:
"""Single crawler worker."""
import aiohttp
async with aiohttp.ClientSession(
headers={"User-Agent": "GoogleBot/2.1"}
) as session:
while self.pages_crawled < self.max_pages:
crawl_url = self.frontier.get_next()
if crawl_url is None:
await asyncio.sleep(0.1)
continue
try:
async with session.get(
crawl_url.url,
timeout=aiohttp.ClientTimeout(total=30)
) as response:
if response.status == 200:
content = await response.text()
links = self._extract_links(crawl_url.url, content)
for link in links:
self.frontier.add(
link, priority=0.5,
depth=crawl_url.depth + 1
)
self.pages_crawled += 1
except Exception:
pass
def _extract_links(self, base_url: str, html: str) -> List[str]:
"""Extract links from HTML content."""
import re
links = []
href_pattern = re.compile(r'href=["\']([^"\']+)["\']')
for match in href_pattern.finditer(html):
href = match.group(1)
if href.startswith(("javascript:", "mailto:", "#")):
continue
absolute_url = urljoin(base_url, href)
if absolute_url.startswith(("http://", "https://")):
links.append(absolute_url)
return links
Phase 5: Scaling and Edge Cases (5 minutes)
Interviewer: "How would you scale this to handle 10x traffic?"
Scaling Strategy
SCALING FROM 10K TO 100K QPS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SCALING STRATEGY β
β β
β QUERY PATH β
β ββββββββββ β
β Current: 10K QPS β Target: 100K QPS β
β β
β Strategy: β
β 1. Add more query servers (10x) β
β 2. Increase index shard replicas (3x β 10x) β
β 3. Add query result cache (hit rate ~30%) β
β 4. Deploy in more regions (latency optimization) β
β β
β INDEX PATH β
β ββββββββββ β
β Current: 50B documents β Target: 500B documents β
β β
β Strategy: β
β 1. Add more index shards (10x) β
β 2. Tiered storage (hot/warm/cold) β
β 3. Better compression (reduce storage cost) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Edge Cases
Edge Case 1: Query Spikes
Scenario: Breaking news causes 10x traffic spike
Problem: Index shards for hot terms get overwhelmed
Solution:
- Query result caching with short TTL
- Replicate hot index shards dynamically
- Graceful degradation (return slightly stale results)
Edge Case 2: Index Freshness
Scenario: Celebrity dies, but search shows them as "alive"
Problem: Cached PageRank and index data is stale
Solution:
- Breaking news detection pipeline
- Fast-path indexing for high-priority content
- Real-time signals from trusted sources
Edge Case 3: Spam Attacks
Scenario: SEO farms create millions of spam pages
Problem: Index quality degrades
Solution:
- Link graph analysis (detect link farms)
- Content fingerprinting (detect duplicates)
- ML spam classifiers
Edge Case 4: Crawl Traps
Scenario: Website generates infinite URLs
Problem: Crawler gets stuck, wastes resources
Solution:
- Detect URL patterns generating infinite content
- Limit crawl depth per domain
- Track unique content ratio
Phase 6: Monitoring and Operations
Monitoring Dashboard
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SEARCH ENGINE MONITORING β
β β
β QUERY PERFORMANCE β
β βββββββββββββββββ β
β QPS: 87,432 /sec [ββββββββββ] 87% β
β Latency p50: 145 ms [ββββββββββ] 29% β
β Latency p99: 412 ms [ββββββββββ] 82% β
β Error rate: 0.02% [ββββββββββ] OK β
β β
β INDEX HEALTH β
β ββββββββββββ β
β Indexed docs: 48.2B [ββββββββββ] 96% β
β Index freshness: 2.3 hours avg [ββββββββββ] OK β
β Shard replicas healthy: 99.97% [ββββββββββ] OK β
β β
β QUALITY SIGNALS β
β βββββββββββββββ β
β Click-through rate: 34.2% [ββββββββββ] β2% β
β Spam detection rate: 99.3% [ββββββββββ] OK β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Alerting Strategy
CRITICAL (Page immediately):
- Query latency p99 > 1s for > 2 min
- Error rate > 1%
- Index shard unavailable
WARNING (Slack notification):
- Query latency p99 > 500ms for > 5 min
- Index freshness > 12 hours
- Click-through rate drops > 10%
Interview Conclusion
Interviewer: "Excellent work. You've demonstrated strong understanding of distributed systems, information retrieval, and the specific challenges of building a search engine. Any questions?"
You: "Thank you! I'd love to hear more about how your team handles the cold start problem for new content β how quickly can fresh content appear in search results?"
Summary: Concepts Applied
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONCEPTS FROM 10-WEEK COURSE IN GOOGLE SEARCH β
β β
β WEEK 1: DATA AT SCALE β
β βββ Partitioning: Index sharded by document ID β
β βββ Replication: Each shard replicated 3-10x β
β βββ Hot Keys: Popular queries cached, hot shards replicated β
β β
β WEEK 2: FAILURE-FIRST DESIGN β
β βββ Timeout budgets: Query stages have strict time limits β
β βββ Graceful degradation: Return partial results on timeout β
β βββ Circuit breakers: Stop sending to unhealthy shards β
β β
β WEEK 3: MESSAGING & ASYNC β
β βββ Async processing: Crawler uses async IO β
β βββ Rate limiting: Per-domain crawl limits β
β βββ Backpressure: URL frontier limits prevent exhaustion β
β β
β WEEK 5: CONSISTENCY & COORDINATION β
β βββ Eventual consistency: Index updates over minutes/hours β
β βββ Distributed computation: PageRank via MapReduce β
β β
β WEEK 10: PRODUCTION READINESS β
β βββ SLOs: Query latency p99 < 500ms, 99.99% availability β
β βββ Observability: QPS, latency, error rates, quality signals β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β WHY GOOGLE SEARCH IS AN ENGINEERING MARVEL β
β β
β SCALE β
β βββββ β
β β’ 8.5+ billion queries per day answered in milliseconds β
β β’ 400+ billion documents indexed and searchable β
β β’ 130+ trillion URLs known to exist on the web β
β β
β LATENCY β
β βββββββ β
β β’ Sub-500ms response time for complex queries β
β β’ Results from 10,000+ index shards merged in real-time β
β β
β QUALITY β
β βββββββ β
β β’ PageRank revolutionized web search quality β
β β’ 200+ ranking signals for relevance β
β β’ 99%+ spam detection rate β
β β
β "Organizing the world's information and making it universally β
β accessible and useful" β achieved at unprecedented scale. β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sources
Official Documentation:
- Google: How Search Works - https://www.google.com/search/howsearchworks/
- Google Search Central - https://developers.google.com/search/docs
Statistics:
- DemandSage: Google Search Statistics 2025 - https://www.demandsage.com/google-search-statistics/
- Search Engine Land: Google 5 Trillion Searches/Year - https://searchengineland.com/google-5-trillion-searches-per-year-452928
- Zyppy: Google Index Size (400B docs) - https://zyppy.com/seo/google-index-size/
Architecture:
- Original Google Paper - http://infolab.stanford.edu/~backrub/google.html
- PageRank Algorithm - https://en.wikipedia.org/wiki/PageRank
- Google Data Centers - https://en.wikipedia.org/wiki/Google_data_centers
Further Reading
Books:
- "Introduction to Information Retrieval" by Manning, Raghavan, SchΓΌtze
- "The Google Story" by David Vise
Papers:
- "The PageRank Citation Ranking" - Page & Brin, 1998
- "MapReduce: Simplified Data Processing" - Dean & Ghemawat, 2004
- "The Google File System" - Ghemawat et al., 2003
Self-Assessment Checklist
After studying this case study, you should be able to:
- Explain how an inverted index works
- Describe the PageRank algorithm
- Design a distributed crawling system
- Explain how to shard an index across thousands of machines
- Describe the query processing pipeline with timeout budgets
- Calculate back-of-envelope estimates for a search engine
- Explain BM25 scoring
- Describe strategies for index freshness
- Design monitoring and alerting for search quality
End of Bonus Problem 5: Google Search : Page with good content (many pages link TO it)
- Hub: Page that links to good content (links TO many authorities)
Mutually reinforcing:
- Good hubs link to good authorities
- Good authorities are linked from good hubs
Algorithm:
- Initialize all pages with authority=1, hub=1
- Authority update: auth(p) = Ξ£ hub(q) for all q linking to p
- Hub update: hub(p) = Ξ£ auth(q) for all q that p links to
- Normalize scores
- Repeat until convergence
Use case:
- Query-specific ranking (PageRank is query-independent)
- Find topic experts and resource pages
E. Real-Time Search Signals
Query Click Data
CLICK DATA FOR RANKING
Google uses billions of search clicks to improve ranking:
Signals extracted:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β CLICK-THROUGH RATE (CTR) β
β βββββββββββββββββββββββββ β
β Higher CTR = More relevant for that query β
β Adjusted for position (top results naturally get more clicks) β
β β
β DWELL TIME β
β ββββββββββ β
β Time spent on page after clicking β
β Long dwell = User found what they wanted β
β Short dwell + back button = Poor result ("pogo-sticking") β
β β
β QUERY REFORMULATION β
β βββββββββββββββββββββ β
β User searches again with different terms? β
β Original results were poor β
β β
β LAST CLICK β
β ββββββββββ β
β The final result user clicked (didn't come back) β
β Strong signal of satisfaction β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Privacy considerations:
- Data is aggregated (no individual tracking)
- Minimum threshold of queries before using signal
- Used for ranking, not targeting
Breaking News Detection
BREAKING NEWS PIPELINE
Goal: Index critical news within minutes
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BREAKING NEWS ARCHITECTURE β
β β
β SIGNAL SOURCES β
β ββββββββββββββ β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β Twitter β β Google β β News β β Google β β
β β Trends β β Trends β β Feeds β β Alerts β β
β βββββββ¬βββββββ βββββββ¬βββββββ βββββββ¬βββββββ βββββββ¬βββββββ β
β β β β β β
β ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β BREAKING NEWS DETECTOR β β
β β β β
β β β’ Spike detection in query volume β β
β β β’ New entity mentions (people, places, events) β β
β β β’ Sentiment shift detection β β
β β β’ Cross-source corroboration β β
β ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PRIORITY CRAWL QUEUE β β
β β β β
β β β’ Jump queue for news sources β β
β β β’ Crawl trusted sources first (AP, Reuters, BBC) β β
β β β’ Multiple crawls as story develops β β
β ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REAL-TIME INDEX (TIER 0) β β
β β β β
β β β’ Bypasses normal index build pipeline β β
β β β’ Direct injection into serving index β β
β β β’ Searchable within 2-5 minutes β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
F. Geographic and Personalized Search
Geolocation in Search
LOCATION-AWARE SEARCH
User query: "pizza"
Without location: Generic results about pizza
With location (user in NYC):
1. Detect user location (IP, GPS, profile)
2. Expand query: "pizza" β "pizza near [lat, lng]"
3. Boost local business results
4. Show map pack with nearby options
Location signals:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β EXPLICIT β
β ββββββββ β
β β’ Query contains location: "pizza in brooklyn" β
β β’ User profile location setting β
β β’ GPS from mobile device β
β β
β IMPLICIT β
β ββββββββ β
β β’ IP geolocation (~city level accuracy) β
β β’ WiFi network location β
β β’ Search history patterns β
β β’ Time zone from browser β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Search Personalization
PERSONALIZATION SIGNALS
Google personalizes results based on:
1. SEARCH HISTORY
- Previous queries indicate interests
- "java" for programmer β programming results
- "java" for traveler β Indonesia results
2. LOCATION HISTORY
- Frequent locations affect local results
- Home vs work vs travel
3. DEVICE TYPE
- Mobile: Emphasize mobile-friendly pages
- Desktop: May show more detailed results
4. LANGUAGE PREFERENCES
- Browser language
- Past language selections
5. GOOGLE ACCOUNT DATA
- Gmail (travel confirmations, purchases)
- Calendar (upcoming events)
- YouTube (interests)
Privacy controls:
- Users can disable personalization
- Incognito mode = no personalization
- Data can be deleted from account
G. Search Quality Evaluation
Human Raters and E-E-A-T
SEARCH QUALITY EVALUATION
Google employs 10,000+ human "Search Quality Raters" who:
- Evaluate search results for quality
- Don't directly affect rankings
- Train ML models and validate algorithm changes
E-E-A-T Framework (Experience, Expertise, Authoritativeness, Trustworthiness):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β EXPERIENCE β
β ββββββββββ β
β Does the content creator have first-hand experience? β
β Example: Product review from someone who actually used it β
β β
β EXPERTISE β
β βββββββββ β
β Does the creator have relevant knowledge/skill? β
β Example: Medical advice from a doctor β
β β
β AUTHORITATIVENESS β
β βββββββββββββββββ β
β Is the creator/site recognized as a go-to source? β
β Example: Government site for tax information β
β β
β TRUSTWORTHINESS β
β ββββββββββββββββ β
β Is the content accurate, honest, safe? β
β Example: Secure site, accurate facts, transparent authorship β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
YMYL (Your Money or Your Life) pages:
- Health, finance, safety, legal topics
- Held to higher E-E-A-T standards
- Misinformation can cause real harm
A/B Testing Search Changes
SEARCH EXPERIMENT FRAMEWORK
Every ranking change is A/B tested:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β EXPERIMENT SETUP β
β ββββββββββββββββ β
β β’ Control group: Current algorithm β
β β’ Treatment group: New algorithm β
β β’ Random 1-5% of traffic to treatment β
β β
β METRICS MEASURED β
β ββββββββββββββββ β
β β’ Click-through rate (CTR) β
β β’ Time to first click β
β β’ Pogo-stick rate (quick backs) β
β β’ Query abandonment rate β
β β’ Long clicks (dwell time > 30s) β
β β’ Query reformulation rate β
β β
β LAUNCH CRITERIA β
β βββββββββββββββ β
β β’ Statistically significant improvement β
β β’ No regression on key metrics β
β β’ Human rater evaluation positive β
β β’ No increase in policy violations β
β β
β Google runs 10,000+ experiments per year β
β Only ~2,000 changes actually launch β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
H. The Future: AI in Search
Neural Ranking Models
EVOLUTION OF RANKING
Traditional (2000s):
TF-IDF + PageRank + Hand-crafted features
β Linear combination of ~50 signals
Learning to Rank (2010s):
Gradient boosted trees (LambdaMART)
β Nonlinear combination of ~200 signals
Neural Ranking (2020s):
BERT, T5, and transformer models
β Semantic understanding of queries and documents
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β BERT IN SEARCH β
β ββββββββββββββ β
β β
β Query: "can you get medicine for someone pharmacy" β
β β
β Before BERT: β
β β’ Matches keywords: medicine, pharmacy β
β β’ Misses intent: picking up prescription for another person β
β β
β With BERT: β
β β’ Understands "for someone" = on behalf of another person β
β β’ Returns results about pharmacy pickup policies β
β β
β Key insight: BERT understands prepositions and context β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Generative AI Search (AI Overviews)
AI OVERVIEWS (2024-2025)
Google now generates AI summaries for many queries:
Query: "how to remove coffee stains from carpet"
Traditional results:
1. [Link to WikiHow article]
2. [Link to cleaning blog]
3. [Link to product review]
AI Overview:
"To remove coffee stains from carpet:
1. Blot (don't rub) excess liquid immediately
2. Mix 1 tablespoon dish soap + 1 tablespoon vinegar + 2 cups water
3. Apply solution to stain, blot with clean cloth
4. Rinse with cold water and blot dry
[Sources: WikiHow, Good Housekeeping, The Spruce]"
Challenges:
β’ Attribution and copyright
β’ Accuracy verification
β’ Impact on publisher traffic
β’ Hallucination prevention
β’ 30x more compute per query
This comprehensive appendix covers advanced topics in search engine design that demonstrate the depth and complexity of building a system like Google Search. These concepts build on the foundational Week 1-10 curriculum and show how theory translates to production systems at massive scale.
End of Bonus Problem 5: Google Search
Document Statistics:
- Total concepts covered: 25+
- Code implementations: 5
- Architecture diagrams: 15+
- Real-world scale numbers: 50+