Design: Web Crawler
The Problem
A web crawler (or spider) is a system that systematically browses the World Wide Web, downloading pages for indexing, archiving, or analysis. Search engines like Google, Bing, and DuckDuckGo depend on crawlers to discover and update billions of pages. The engineering challenge is immense: the web has over 4.5 billion indexed pages, each page can link to hundreds of others, and the whole corpus is in constant flux — pages are created, updated, and deleted every second.
Building a crawler that handles this at scale requires careful design of its URL scheduling (the frontier), politeness mechanisms, deduplication layers, and distributed architecture.
Requirements
Functional Requirements
- Crawl 1 billion pages per month (~386 pages/second sustained).
- Discover new pages by following hyperlinks from seed URLs.
- Respect
robots.txtdirectives andCrawl-delayheaders. - Detect and avoid duplicate URLs and duplicate content.
- Store downloaded content for downstream indexing/processing.
- Re-crawl pages based on change frequency and importance.
Non-Functional Requirements
- Politeness: Never overwhelm a host — max 1 request/second per domain.
- Scalability: Horizontally scalable to handle 10× traffic surges.
- Robustness: Handle malformed HTML, infinite URL spaces, spider traps, and network failures gracefully.
- Extensibility: Pluggable modules for content extraction, link filtering, and priority scoring.
- Efficiency: Minimize DNS lookups, re-downloads, and wasted bandwidth.
Back-of-the-Envelope Estimation
| Metric | Value |
|---|---|
| Pages / month | 1,000,000,000 |
| Pages / second | ~386 (1B / 30 days / 86400s) |
| Avg page size | ~100 KB (HTML only) |
| Monthly bandwidth | ~100 TB (1B × 100 KB) |
| Peak pages / second (5×) | ~1,930 |
| Unique URLs stored | ~10 billion (seen URLs) |
| URL storage (avg 100 bytes × 10B) | ~1 TB |
High-Level Architecture
The crawler follows a loop:
Seed URLs
│
▼
┌─────────────────┐ ┌──────────────┐ ┌──────────────┐
│ URL Frontier │────▶│ DNS Resolver │────▶│ Fetcher │
│ (Priority Queue │ │ (Cached) │ │ (HTTP Client)│
│ + Politeness Q) │ └──────────────┘ └──────┬───────┘
└────────▲─────────┘ │
│ ▼
│ ┌──────────────────┐
│ │ Content Parser │
│ │ (HTML → DOM tree) │
│ └──────┬───────────┘
│ │
│ ▼
│ ┌──────────────────┐
│ │ URL Extractor │
│ │ (href, canonical) │
│ └──────┬───────────┘
│ │
│ ┌────────────────┐ │
│ │ Content Store │◀─────┐ ▼
│ │ (S3 / HDFS) │ │ ┌───────────────┐
│ └────────────────┘ │ │ Dedup Layer │
│ │ │ (Bloom Filter │
└─────────────────────────────────┤ │ + SimHash) │
│ └───────┬───────┘
│ │
└──────────┘
new URLs → frontier
content → store
Each component is designed to be independently scalable. The URL Frontier is the heart of the system — it decides what to crawl and when. The Dedup Layer prevents wasted work. The Fetcher is the I/O-heavy component that benefits most from horizontal scaling.
Crawl Flow — Interactive Visualization
Click Step to trace a URL through the entire crawl pipeline. Watch how a seed URL enters the frontier, gets resolved, fetched, parsed, and its links are deduplicated before re-entering the frontier.
Click Step to begin the crawl cycle.
URL Frontier Design
The URL Frontier is the most critical component. It must answer two questions simultaneously: "What is the most important URL to crawl next?" (priority) and "Can we crawl this host right now without being rude?" (politeness). These are in tension — the highest-priority URL might belong to a host we just hit 200ms ago.
Priority Queue
URLs are scored and inserted into one of N priority buckets (e.g., N = 256). Each bucket is a FIFO queue. Higher-priority buckets are drained first.
Priority scoring factors:
- PageRank / link authority: Pages linked by many high-quality pages get higher priority.
- Freshness: Pages that change frequently (news homepages, stock tickers) are re-crawled sooner.
- Depth: URLs discovered at shallower BFS depth are prioritized (they tend to be more important).
- Domain importance:
wikipedia.orgranks higher than a random personal blog. - Content type hint:
.htmlpages are prioritized over.pdfor.zipfiles.
// Simplified priority assignment
function prioritize(url, metadata) {
let score = 0;
score += metadata.pageRank * 40; // 0–40 points
score += metadata.domainAuthority * 20; // 0–20 points
score += (10 - metadata.depth) * 3; // 0–30: shallower = higher
score += metadata.changeFrequency * 10; // 0–10: more changes = higher
return Math.min(255, Math.max(0, Math.floor(score)));
}
// Insert into bucket
frontier.buckets[prioritize(url, meta)].enqueue(url);
Politeness Queue
Once a URL is selected from a priority bucket, it enters the politeness layer. This layer maintains one queue per domain. A fetcher thread can only pull from a domain queue if sufficient time has elapsed since the last request to that domain.
// Per-domain queue structure
domainQueues = {
"example.com": Queue([url1, url4, url9]),
"wikipedia.org": Queue([url2, url7]),
"news.ycombinator.com": Queue([url3]),
...
}
// Rate limit: respect robots.txt Crawl-delay or default 1 req/sec
domainTimers = {
"example.com": { lastFetch: 1714000000, delay: 1000ms },
"wikipedia.org": { lastFetch: 1714000001, delay: 5000ms },
...
}
// Fetcher picks next available domain
function pickNext() {
for (domain in domainQueues) {
if (now() - domainTimers[domain].lastFetch >= domainTimers[domain].delay) {
url = domainQueues[domain].dequeue();
domainTimers[domain].lastFetch = now();
return url;
}
}
return null; // all domains on cooldown
}
Data Structures
At 10 billion URLs, the frontier can't fit entirely in RAM. A practical approach:
- Front queues (priority): In-memory priority buckets holding the next ~1M URLs. Refilled from disk when drained.
- Back queues (politeness): In-memory per-domain queues, lazily loaded. Only active domains are in memory.
- Disk backing: A RocksDB or LevelDB instance stores the full frontier. URLs are batch-loaded into memory queues.
- Heap-based domain selector: A min-heap keyed by
lastFetch + delaytells the fetcher which domain is next eligible.
┌────────────────────────────────────────────────────┐
│ URL FRONTIER │
│ │
│ ┌─── Front Queues (Priority) ───┐ │
│ │ Bucket 255 ▓▓▓▓▓ (highest) │ │
│ │ Bucket 254 ▓▓▓ │ │
│ │ Bucket 253 ▓▓▓▓ │ │
│ │ ... │ │
│ │ Bucket 0 ▓ (lowest) │ │
│ └────────────┬──────────────────┘ │
│ │ drain by priority │
│ ▼ │
│ ┌─── Back Queues (Politeness) ──┐ │
│ │ example.com [url1, url4] │ │
│ │ wikipedia.org [url2] │ │
│ │ github.com [url5, url8] │ │
│ └────────────┬──────────────────┘ │
│ │ pick domain with │
│ │ earliest eligibility │
│ ▼ │
│ ┌──────────┐ │
│ │ Fetcher │ │
│ └──────────┘ │
└────────────────────────────────────────────────────┘
BFS vs DFS Traversal
The web is a directed graph. How we traverse it matters enormously:
| Aspect | BFS (Breadth-First) | DFS (Depth-First) |
|---|---|---|
| Discovery pattern | Explores all links at depth d before d+1 | Dives deep into one path before backtracking |
| Page importance | Finds high-authority pages first (homepages, nav pages) | May get stuck on deep, low-value paths |
| Politeness | Naturally spreads requests across many domains | Hammers one domain before moving on |
| Spider trap risk | Low — breadth limits depth exposure | High — can get trapped in infinite URL spaces |
| Memory | Higher (stores entire frontier at current depth) | Lower (only current path in stack) |
BFS is strongly preferred for web crawling because it discovers important pages early, distributes load across domains naturally, and is resistant to spider traps. In practice, crawlers use a priority-weighted BFS where the frontier is a priority queue rather than a strict FIFO.
Robots.txt Parsing
The Robots Exclusion Protocol (RFC 9309) is the web's social contract between crawlers and site owners. A robots.txt file at the domain root tells crawlers what they may and may not access.
File Format
# Example: https://example.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search?q=
Allow: /admin/public/
Crawl-delay: 2
User-agent: Googlebot
Disallow: /no-google/
Allow: /
User-agent: BadBot
Disallow: /
Sitemap: https://example.com/sitemap.xml
Parsing Rules
- User-agent matching: Find the most specific
User-agentgroup that matches your crawler's name. If none matches, fall back toUser-agent: *. - Allow vs Disallow: For a given path, the longest matching rule wins.
Allow: /admin/public/overridesDisallow: /admin/for paths under/admin/public/. - Crawl-delay: The number of seconds to wait between successive requests to this domain. Not all crawlers respect this (Google ignores it), but a polite crawler should honor it.
- Sitemap: Points to an XML sitemap for URL discovery. Crawlers should fetch and parse sitemaps to find pages that aren't linked from anywhere.
- Wildcards:
*matches any sequence of characters.$matches the end of URL. E.g.,Disallow: /*.pdf$blocks all PDF files.
// Robots.txt parser (simplified)
class RobotsParser {
constructor(robotsTxt, userAgent) {
this.rules = [];
this.crawlDelay = 1; // default 1 second
this.sitemaps = [];
this.parse(robotsTxt, userAgent);
}
parse(text, ua) {
let lines = text.split('\n');
let activeGroup = false;
let wildcard = false;
for (let line of lines) {
line = line.split('#')[0].trim(); // strip comments
if (!line) continue;
let [directive, value] = line.split(':').map(s => s.trim());
directive = directive.toLowerCase();
if (directive === 'user-agent') {
if (value === ua) activeGroup = true;
else if (value === '*') wildcard = true;
else activeGroup = false;
}
else if (activeGroup || (!activeGroup && wildcard)) {
if (directive === 'disallow' && value) {
this.rules.push({ path: value, allowed: false });
}
else if (directive === 'allow' && value) {
this.rules.push({ path: value, allowed: true });
}
else if (directive === 'crawl-delay') {
this.crawlDelay = parseFloat(value) || 1;
}
else if (directive === 'sitemap') {
this.sitemaps.push(value);
}
}
}
// Sort by path length descending (longest match first)
this.rules.sort((a, b) => b.path.length - a.path.length);
}
isAllowed(urlPath) {
for (let rule of this.rules) {
if (this.matchPath(urlPath, rule.path)) {
return rule.allowed;
}
}
return true; // no matching rule = allowed
}
matchPath(urlPath, pattern) {
// Convert robots.txt pattern to regex
let regex = pattern
.replace(/[.+?^${}()|[\]\\]/g, '\\$&') // escape special chars
.replace(/\*/g, '.*') // * → .*
.replace(/\\\$$/g, '$'); // trailing \$ → $
return new RegExp('^' + regex).test(urlPath);
}
}
Caching robots.txt
Fetching robots.txt before every page is wasteful. The crawler caches it per domain with a TTL (typically 24 hours or per the HTTP Cache-Control header). A background thread refreshes expiring entries. If robots.txt returns a 4xx, the entire domain is assumed allowed. If it returns a 5xx, the crawler should back off and retry later.
Duplicate Detection
Duplicates waste bandwidth, storage, and processing time. There are two kinds:
URL Deduplication
The same page can be reached through many different URLs:
http://example.com/pagevshttps://example.com/pageexample.com/page/vsexample.com/page(trailing slash)example.com/page?utm_source=twittervsexample.com/page(tracking params)example.com/Pagevsexample.com/page(case sensitivity)
URL canonicalization normalizes URLs before dedup:
function canonicalize(url) {
let u = new URL(url);
// 1. Lowercase scheme and host
u.protocol = u.protocol.toLowerCase();
u.hostname = u.hostname.toLowerCase();
// 2. Remove default port
if ((u.protocol === 'http:' && u.port === '80') ||
(u.protocol === 'https:' && u.port === '443')) {
u.port = '';
}
// 3. Remove fragment
u.hash = '';
// 4. Remove tracking parameters
let strip = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content','fbclid','gclid'];
strip.forEach(p => u.searchParams.delete(p));
// 5. Sort remaining params
u.searchParams.sort();
// 6. Remove trailing slash (except root)
if (u.pathname !== '/' && u.pathname.endsWith('/')) {
u.pathname = u.pathname.slice(0, -1);
}
// 7. Decode unnecessary percent-encoding
u.pathname = decodeURIComponent(u.pathname);
return u.toString();
}
After canonicalization, we check membership using a Bloom filter:
// Bloom filter for URL dedup
// 10 billion URLs, 1% false positive rate → ~11.5 GB memory
// k = 7 hash functions, m = 96 billion bits
class URLBloomFilter {
constructor(expectedItems = 10_000_000_000, fpRate = 0.01) {
this.m = Math.ceil(-expectedItems * Math.log(fpRate) / (Math.log(2) ** 2));
this.k = Math.round((this.m / expectedItems) * Math.log(2));
this.bits = new BitArray(this.m); // ~11.5 GB
}
add(url) {
for (let i = 0; i < this.k; i++) {
this.bits.set(this.hash(url, i) % this.m);
}
}
mightContain(url) {
for (let i = 0; i < this.k; i++) {
if (!this.bits.get(this.hash(url, i) % this.m)) return false;
}
return true; // possibly seen (small false positive rate)
}
hash(url, seed) {
return murmurHash3(url, seed); // fast, well-distributed hash
}
}
Content Deduplication (SimHash)
Different URLs can serve identical or near-identical content (mirrors, syndicated articles, paginated duplicates). Byte-level hashing (MD5/SHA) catches exact duplicates but misses near-duplicates that differ by ads, timestamps, or minor layout changes.
SimHash (invented by Moses Charikar, used by Google) produces a fingerprint where similar documents produce similar hashes. Two documents are near-duplicates if their SimHash values differ by only a few bits (Hamming distance ≤ 3).
SimHash Algorithm
function simHash(document, hashBits = 64) {
// Step 1: Extract features (weighted shingles)
let tokens = tokenize(document);
let shingles = [];
for (let i = 0; i <= tokens.length - 3; i++) {
shingles.push(tokens.slice(i, i + 3).join(' ')); // 3-word shingles
}
// Step 2: Initialize a V vector of size hashBits, all zeros
let V = new Array(hashBits).fill(0);
// Step 3: For each shingle, compute its hash and update V
for (let shingle of shingles) {
let weight = getWeight(shingle); // TF-IDF or frequency
let h = hash64(shingle); // 64-bit hash
for (let i = 0; i < hashBits; i++) {
if ((h >> BigInt(i)) & 1n) {
V[i] += weight; // bit is 1 → add weight
} else {
V[i] -= weight; // bit is 0 → subtract weight
}
}
}
// Step 4: Convert V to fingerprint
let fingerprint = 0n;
for (let i = 0; i < hashBits; i++) {
if (V[i] > 0) {
fingerprint |= (1n << BigInt(i));
}
}
return fingerprint;
}
// Hamming distance between two SimHash fingerprints
function hammingDistance(a, b) {
let xor = a ^ b;
let count = 0;
while (xor) {
count += Number(xor & 1n);
xor >>= 1n;
}
return count;
}
// Near-duplicate if Hamming distance ≤ 3
function isNearDuplicate(fp1, fp2) {
return hammingDistance(fp1, fp2) <= 3;
}
Efficient SimHash Lookup
Checking a new fingerprint against billions of stored fingerprints for Hamming distance ≤ 3 is expensive. The table-based approach partitions the 64-bit hash into blocks and uses multiple lookup tables:
// Partition 64-bit SimHash into 4 blocks of 16 bits each
// Store in 4 tables, keyed by each block
// If two hashes differ by ≤3 bits total, at least ONE of the
// 4 blocks must be identical (pigeonhole principle)
tables = [HashMap(), HashMap(), HashMap(), HashMap()];
function store(fingerprint) {
for (let i = 0; i < 4; i++) {
let block = (fingerprint >> BigInt(i * 16)) & 0xFFFFn;
tables[i].getOrCreate(block).add(fingerprint);
}
}
function findNearDuplicates(fingerprint) {
let candidates = new Set();
for (let i = 0; i < 4; i++) {
let block = (fingerprint >> BigInt(i * 16)) & 0xFFFFn;
for (let stored of tables[i].get(block, [])) {
if (hammingDistance(fingerprint, stored) <= 3) {
candidates.add(stored);
}
}
}
return candidates;
}
MinHash for Jaccard Similarity
MinHash is an alternative for set-similarity detection. It estimates Jaccard similarity between two documents' shingle sets without comparing them directly.
function minHash(shingleSet, numHashes = 200) {
let signature = new Array(numHashes).fill(Infinity);
for (let shingle of shingleSet) {
for (let i = 0; i < numHashes; i++) {
let h = hashWithSeed(shingle, i);
signature[i] = Math.min(signature[i], h);
}
}
return signature;
}
// Estimated Jaccard similarity
function jaccardEstimate(sig1, sig2) {
let matches = 0;
for (let i = 0; i < sig1.length; i++) {
if (sig1[i] === sig2[i]) matches++;
}
return matches / sig1.length;
}
// Near-duplicate if Jaccard ≥ 0.8
function isNearDup(sig1, sig2) {
return jaccardEstimate(sig1, sig2) >= 0.8;
}
Politeness Queue — Interactive Visualization
Click Step to watch the fetcher pick from different domain queues. Notice how each domain has its own rate limit — the fetcher never sends two concurrent requests to the same domain.
Click Step to see politeness-based domain scheduling.
DNS Resolution Caching
DNS resolution is a hidden bottleneck. Each fetch requires resolving the hostname to an IP address. Standard OS DNS caches are too small for a crawler hitting millions of unique domains.
Crawler DNS Strategy
- Local DNS cache: An in-process LRU cache holding ~1M domain → IP mappings. TTL respects the DNS record's TTL (usually 5 minutes to 24 hours).
- Dedicated DNS resolver: Run a local
unboundordnsmasqinstance with a large cache (1–2 GB RAM). This avoids saturating upstream DNS servers. - Batch prefetching: When loading URLs from the frontier into the politeness queue, resolve DNS for all of them in parallel using async I/O.
- Fallback: If resolution fails, retry with a different DNS provider (Google DNS 8.8.8.8, Cloudflare 1.1.1.1). After 3 failures, mark the domain as unreachable and skip its URLs for 24 hours.
class DNSCache {
constructor(maxEntries = 1_000_000) {
this.cache = new LRUCache(maxEntries);
this.negativeCache = new LRUCache(100_000); // failed lookups
}
async resolve(hostname) {
// Check positive cache
let cached = this.cache.get(hostname);
if (cached && cached.expiry > Date.now()) return cached.ip;
// Check negative cache (don't retry too soon)
let neg = this.negativeCache.get(hostname);
if (neg && neg.retryAfter > Date.now()) return null;
try {
let result = await dns.resolve4(hostname);
this.cache.set(hostname, {
ip: result[0],
expiry: Date.now() + result.ttl * 1000
});
return result[0];
} catch (err) {
this.negativeCache.set(hostname, {
retryAfter: Date.now() + 24 * 3600 * 1000 // 24h backoff
});
return null;
}
}
}
Content Parsing & Link Extraction
After fetching a page, the crawler must extract useful information and discover new URLs.
Parsing Pipeline
function processPage(url, html, headers) {
// 1. Charset detection & decoding
let charset = detectCharset(headers['content-type'], html);
let text = decode(html, charset);
// 2. Parse HTML into DOM
let dom = parseHTML(text); // using a lenient parser (like htmlparser2)
// 3. Check for meta robots directives
let metaRobots = dom.querySelector('meta[name="robots"]');
if (metaRobots) {
let content = metaRobots.getAttribute('content').toLowerCase();
if (content.includes('noindex')) skipIndexing(url);
if (content.includes('nofollow')) return []; // don't extract links
}
// 4. Extract canonical URL
let canonical = dom.querySelector('link[rel="canonical"]');
if (canonical && canonical.href !== url) {
// This page says it's a duplicate of canonical.href
markAsDuplicate(url, canonical.href);
}
// 5. Extract links
let links = [];
for (let anchor of dom.querySelectorAll('a[href]')) {
let href = anchor.getAttribute('href');
let rel = anchor.getAttribute('rel') || '';
// Skip nofollow links
if (rel.includes('nofollow')) continue;
// Resolve relative URLs
let absoluteUrl = resolveUrl(url, href);
// Filter: only http/https, skip mailto:, javascript:, etc.
if (absoluteUrl.startsWith('http://') || absoluteUrl.startsWith('https://')) {
links.push(canonicalize(absoluteUrl));
}
}
// 6. Extract metadata
let metadata = {
title: dom.querySelector('title')?.textContent,
description: dom.querySelector('meta[name="description"]')?.content,
language: dom.documentElement.getAttribute('lang'),
lastModified: headers['last-modified'],
};
return { links, metadata, text: extractText(dom) };
}
Handling Different Content Types
| Content-Type | Action |
|---|---|
text/html | Full parsing, link extraction, indexing |
application/pdf | PDF text extraction (Apache Tika), index but no link extraction |
application/xml | Check if sitemap — parse for URLs if so |
image/*, video/* | Skip or store metadata only (alt text from referring page) |
application/javascript | Skip — no crawlable content |
Storage Architecture
URL Database
Tracks every URL the crawler has seen, with metadata:
CREATE TABLE urls (
url_hash BIGINT PRIMARY KEY, -- 64-bit hash of canonical URL
canonical_url TEXT NOT NULL,
domain TEXT NOT NULL,
last_crawled TIMESTAMP,
next_crawl TIMESTAMP, -- scheduled recrawl time
http_status SMALLINT,
content_hash BIGINT, -- SimHash fingerprint
priority SMALLINT DEFAULT 0,
crawl_count INT DEFAULT 0,
change_freq INTERVAL, -- estimated change frequency
INDEX idx_next_crawl (next_crawl),
INDEX idx_domain (domain)
);
With 10 billion rows, this table is sharded by url_hash across multiple database nodes. Each shard holds ~500M–1B rows.
Content Store
Raw HTML and extracted text are stored in an object store (S3, HDFS, or Google Cloud Storage):
// Content storage layout
s3://crawler-content/
├── raw/ # Raw HTML responses
│ ├── 2026/04/15/
│ │ ├── {url_hash_1}.html.gz # gzipped HTML
│ │ ├── {url_hash_2}.html.gz
│ │ └── ...
├── parsed/ # Extracted text + metadata (JSON)
│ ├── 2026/04/15/
│ │ ├── {url_hash_1}.json.gz
│ │ └── ...
└── index/ # Manifest files for batch processing
├── 2026-04-15-manifest.jsonl
└── ...
At 100 KB average page size × 1B pages/month = ~100 TB/month. With gzip compression (~5:1 ratio), actual storage is ~20 TB/month. Lifecycle policies automatically tier older data to cheaper storage (S3 Glacier) after 90 days.
Distributed Crawling
A single machine can fetch perhaps 50–100 pages/second. At 386 pages/second sustained, we need at least 4–8 crawler nodes. At peak (1,930 pages/sec), we need 20–40 nodes.
Partitioning Strategy
Two main approaches:
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| URL hash partition | node = hash(url) % N | Even distribution, simple | Same domain spread across nodes — harder to enforce per-domain politeness |
| Domain-based partition | node = hash(domain) % N | All URLs for a domain on one node — easy politeness enforcement | Skewed load (wikipedia.org has vastly more URLs than a personal blog) |
Domain-based partitioning is preferred because it localizes politeness state (robots.txt cache, crawl-delay timers, per-domain queues) to a single node, eliminating distributed coordination.
Coordination Architecture
┌───────────────────────────────────────────────┐
│ Coordinator Service │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Assigner │ │ Monitor │ │ Rebalancer│ │
│ └───────────┘ └───────────┘ └───────────┘ │
└───────────────────────┬───────────────────────┘
│ domain assignments
┌─────────────┼─────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Crawler │ │ Crawler │ │ Crawler │
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ │ │ │ │ │
│ domains: │ │ domains: │ │ domains: │
│ A-F hash │ │ G-M hash │ │ N-Z hash │
│ range │ │ range │ │ range │
└────────────┘ └────────────┘ └────────────┘
│ │ │
▼ ▼ ▼
┌────────────────────────────────────────┐
│ Shared URL Database (sharded) │
└────────────────────────────────────────┘
┌────────────────────────────────────────┐
│ Content Store (S3 / HDFS) │
└────────────────────────────────────────┘
- Assigner: Maps domain hash ranges to crawler nodes using consistent hashing.
- Monitor: Tracks node health via heartbeats. Detects failed nodes within 30 seconds.
- Rebalancer: When a node fails or is added, reassigns domain ranges with minimal disruption.
Cross-Node URL Exchange
When Crawler Node 1 finds a link to github.com (owned by Node 3), it must route the URL to the correct node:
// After extracting links from a page
for (let link of extractedLinks) {
let targetNode = consistentHash(getDomain(link)) % numNodes;
if (targetNode === myNodeId) {
localFrontier.add(link); // local — add directly
} else {
outboundBuffers[targetNode].add(link); // remote — buffer
}
}
// Flush buffers periodically (every 1 second or 1000 URLs)
for (let node in outboundBuffers) {
if (outboundBuffers[node].size > 1000 || timeSinceFlush > 1000) {
sendBatch(node, outboundBuffers[node].drain());
}
}
Recrawl Strategy
Not all pages change at the same rate. An effective recrawl policy prioritizes pages that are likely to have changed since the last visit.
Change Frequency Estimation
- Historical change rate: If a page changed in 3 of the last 5 crawls, it changes ~60% of the time. Recrawl more often.
- HTTP headers:
Last-ModifiedandETagallow conditional GETs (If-Modified-Since), saving bandwidth when the page hasn't changed. - Page type heuristic: News homepages change every few minutes. Product pages change weekly. Archive pages change yearly.
- Sitemap
<changefreq>: Site owners declare expected change frequency (hourly, daily, weekly, etc.).
// Adaptive recrawl scheduling
function scheduleRecrawl(url, crawlHistory) {
let changes = crawlHistory.filter(c => c.changed).length;
let total = crawlHistory.length;
let changeRate = total > 0 ? changes / total : 0.5; // default 50%
// Exponential backoff: pages that rarely change get crawled less often
let baseInterval;
if (changeRate > 0.8) baseInterval = 1 * HOUR; // very dynamic
else if (changeRate > 0.5) baseInterval = 6 * HOUR;
else if (changeRate > 0.2) baseInterval = 1 * DAY;
else if (changeRate > 0.05) baseInterval = 7 * DAY;
else baseInterval = 30 * DAY; // nearly static
// Adjust by page importance (PageRank)
let importance = getPageRank(url);
let interval = baseInterval / (1 + importance); // important pages crawled sooner
return Date.now() + interval;
}
If-Modified-Since and If-None-Match headers. If the server returns 304 Not Modified, you save bandwidth and know the content hasn't changed — without downloading the full page.
Handling Spider Traps
Spider traps are pages (intentional or accidental) that generate an infinite number of unique URLs, causing the crawler to waste resources endlessly.
Common Trap Types
- Infinite calendar:
/calendar?date=2026-04-15→ links to?date=2026-04-16→?date=2026-04-17→ forever. - Session IDs in URLs:
/page?sid=abc123— each visit generates a new session ID, creating a new "unique" URL. - Soft 404s: Server returns 200 OK for any URL path (
/anything/you/type), making the URL space infinite. - Deliberately hostile:
<a href="/trap/<random_string>">— a page that generates links to random sub-paths. - Parameterized pagination:
/search?page=1,?page=2, ...?page=999999999.
Defense Mechanisms
- Max depth limit: Don't follow links beyond depth 15–20 from seed URLs. Deep pages are rarely valuable.
- Per-domain URL budget: Cap the number of URLs crawled per domain per cycle (e.g., 100,000). Prevents any single domain from consuming the entire crawl budget.
- URL pattern detection: If we see many URLs matching the same pattern with only a changing parameter (e.g.,
/calendar?date=*), sample a few and stop. - Content fingerprinting: If the last 10 pages from a domain all have the same SimHash, the domain is likely a trap. Deprioritize it.
- URL length limit: URLs longer than 2,048 characters are suspicious. Skip them.
- Blacklisting: Maintain a manual blacklist of known trap domains, updated by operators.
// Spider trap detection
class TrapDetector {
constructor() {
this.domainStats = new Map(); // domain → { count, patterns, hashes }
}
check(url, domain, contentHash) {
let stats = this.domainStats.get(domain) || { count: 0, patterns: {}, recentHashes: [] };
stats.count++;
// Check 1: Per-domain URL budget
if (stats.count > 100_000) {
return { isTrap: true, reason: 'URL budget exceeded' };
}
// Check 2: URL pattern clustering
let pattern = url.replace(/[0-9]+/g, '{N}')
.replace(/[a-f0-9]{8,}/gi, '{HASH}');
stats.patterns[pattern] = (stats.patterns[pattern] || 0) + 1;
if (stats.patterns[pattern] > 1000) {
return { isTrap: true, reason: `Pattern "${pattern}" seen 1000+ times` };
}
// Check 3: Content similarity
stats.recentHashes.push(contentHash);
if (stats.recentHashes.length > 10) {
stats.recentHashes.shift();
let allSame = stats.recentHashes.every(h =>
hammingDistance(h, stats.recentHashes[0]) <= 3
);
if (allSame) {
return { isTrap: true, reason: 'Last 10 pages are near-identical' };
}
}
this.domainStats.set(domain, stats);
return { isTrap: false };
}
}
JavaScript Rendering
Modern single-page applications (React, Angular, Vue) render content client-side. A traditional crawler that only reads raw HTML will see an empty <div id="root"></div> with no links or content.
Rendering Approaches
| Approach | Description | Latency | Cost |
|---|---|---|---|
| Static HTML only | Parse raw HTML as-is | ~10ms | Very low |
| Headless browser | Chromium (Puppeteer) executes JS, waits for DOM | 2–10s | Very high (CPU + RAM) |
| Lightweight JS engine | V8 isolate without full browser (no layout/paint) | 500ms–2s | Medium |
| Hybrid | Try static first; if minimal content detected, re-queue for JS rendering | Varies | Optimal |
The hybrid approach is used by Google's crawler (Googlebot). It first processes the raw HTML (fast, cheap). If the page contains framework markers (<div id="root">, <app-root>, noscript tags with content hints), it's re-queued for a separate rendering service that runs headless Chrome at scale.
// Hybrid rendering decision
function needsJSRendering(html) {
// Check for SPA framework signatures
let spaMarkers = [
'<div id="root"></div>',
'<div id="app"></div>',
'<app-root>',
'window.__NEXT_DATA__',
'window.__NUXT__',
];
let hasMarker = spaMarkers.some(m => html.includes(m));
// Check if meaningful content exists
let textContent = stripTags(html).trim();
let hasContent = textContent.length > 500; // at least 500 chars of text
// Needs JS rendering if SPA markers present AND no meaningful static content
return hasMarker && !hasContent;
}
// Rendering service (Puppeteer pool)
async function renderWithJS(url) {
let browser = await browserPool.acquire();
try {
let page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', req => {
// Block images, fonts, stylesheets — only need HTML + JS
if (['image','stylesheet','font','media'].includes(req.resourceType())) {
req.abort();
} else {
req.continue();
}
});
await page.goto(url, { waitUntil: 'networkidle2', timeout: 15000 });
let html = await page.content();
await page.close();
return html;
} finally {
browserPool.release(browser);
}
}
Complete Pipeline Summary
┌──────────────────────────────────────────────────────────────────────┐
│ WEB CRAWLER PIPELINE │
│ │
│ ┌──────┐ ┌───────────┐ ┌────────────┐ ┌──────────────┐ │
│ │ Seed │──▶│ URL │──▶│ robots.txt │──▶│ DNS │ │
│ │ URLs │ │ Frontier │ │ Checker │ │ Resolver │ │
│ └──────┘ │ │ └────────────┘ │ (cached) │ │
│ │ Priority │ └──────┬───────┘ │
│ │ + Polite │ │ │
│ └─────▲─────┘ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Fetcher │ │
│ │ │ (HTTP/1.1 │ │
│ │ │ HTTP/2) │ │
│ │ └──────┬───────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ JS Render? │ │
│ │ │ (if needed) │ │
│ │ └──────┬───────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌─────────┐ ┌──────────────┐ │
│ │ │ Content │◀────│ Content │ │
│ │ │ Store │ │ Parser │ │
│ │ │ (S3) │ └──────┬───────┘ │
│ │ └─────────┘ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ URL │ │
│ │ │ Extractor │ │
│ │ └──────┬───────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Dedup │ │
│ │ │ (Bloom + │ │
│ │ new URLs │ SimHash) │ │
│ └─────────────────────────┤ │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Key Operational Metrics
- Pages crawled / second — primary throughput metric.
- URL frontier size — if growing unboundedly, something is wrong (possibly a spider trap).
- Duplicate rate — percentage of URLs/content filtered by dedup. Should be 30–60%.
- Error rate — 4xx/5xx responses, DNS failures, timeouts. Alert if > 5%.
- Politeness compliance — verify no domain receives > 1 req/sec (or its crawl-delay).
- Freshness — average age of content in the index. Lower is better for news; acceptable for archives.
- Rendering queue depth — how many pages are waiting for JS rendering. Monitor for backlog.
Suggested Technology Stack
| Component | Technology |
|---|---|
| URL Frontier | RocksDB (disk) + in-memory priority heaps |
| URL Dedup | Bloom filter (in-memory, 12 GB) |
| Content Dedup | SimHash stored in Redis / Cassandra |
| DNS Cache | Local unbound + in-process LRU |
| Fetcher | Async HTTP client (Netty / aiohttp / Go net/http) |
| HTML Parser | htmlparser2 (JS) / jsoup (Java) / lxml (Python) |
| JS Rendering | Puppeteer / Playwright pool (headless Chromium) |
| Content Store | Amazon S3 / HDFS |
| URL Database | Sharded PostgreSQL or Cassandra |
| Cross-Node Messaging | Kafka / gRPC streams |
| Coordination | ZooKeeper / etcd for leader election + config |
In the next post, we'll design a Payment System — handling money with correctness guarantees, idempotency, and distributed transaction management.