High Level Design Series · Real-World Designs · Post 61 of 70

Design: Web Crawler

April 2026 · 26 min read

The Problem

A web crawler (or spider) is a system that systematically browses the World Wide Web, downloading pages for indexing, archiving, or analysis. Search engines like Google, Bing, and DuckDuckGo depend on crawlers to discover and update billions of pages. The engineering challenge is immense: the web has over 4.5 billion indexed pages, each page can link to hundreds of others, and the whole corpus is in constant flux — pages are created, updated, and deleted every second.

Building a crawler that handles this at scale requires careful design of its URL scheduling (the frontier), politeness mechanisms, deduplication layers, and distributed architecture.

Requirements

Functional Requirements

Crawl 1 billion pages per month (~386 pages/second sustained).
Discover new pages by following hyperlinks from seed URLs.
Respect robots.txt directives and Crawl-delay headers.
Detect and avoid duplicate URLs and duplicate content.
Store downloaded content for downstream indexing/processing.
Re-crawl pages based on change frequency and importance.

Non-Functional Requirements

Politeness: Never overwhelm a host — max 1 request/second per domain.
Scalability: Horizontally scalable to handle 10× traffic surges.
Robustness: Handle malformed HTML, infinite URL spaces, spider traps, and network failures gracefully.
Extensibility: Pluggable modules for content extraction, link filtering, and priority scoring.
Efficiency: Minimize DNS lookups, re-downloads, and wasted bandwidth.

Back-of-the-Envelope Estimation

Metric	Value
Pages / month	1,000,000,000
Pages / second	~386 (1B / 30 days / 86400s)
Avg page size	~100 KB (HTML only)
Monthly bandwidth	~100 TB (1B × 100 KB)
Peak pages / second (5×)	~1,930
Unique URLs stored	~10 billion (seen URLs)
URL storage (avg 100 bytes × 10B)	~1 TB

High-Level Architecture

The crawler follows a loop:

Seed URLs
   │
   ▼
┌─────────────────┐     ┌──────────────┐     ┌──────────────┐
│   URL Frontier   │────▶│ DNS Resolver  │────▶│   Fetcher    │
│ (Priority Queue  │     │   (Cached)    │     │ (HTTP Client)│
│ + Politeness Q)  │     └──────────────┘     └──────┬───────┘
└────────▲─────────┘                                  │
         │                                            ▼
         │                                  ┌──────────────────┐
         │                                  │  Content Parser   │
         │                                  │ (HTML → DOM tree) │
         │                                  └──────┬───────────┘
         │                                         │
         │                                         ▼
         │                                  ┌──────────────────┐
         │                                  │  URL Extractor    │
         │                                  │ (href, canonical) │
         │                                  └──────┬───────────┘
         │                                         │
         │         ┌────────────────┐              │
         │         │  Content Store  │◀─────┐      ▼
         │         │  (S3 / HDFS)   │      │  ┌───────────────┐
         │         └────────────────┘      │  │  Dedup Layer   │
         │                                 │  │ (Bloom Filter  │
         └─────────────────────────────────┤  │ + SimHash)     │
                                           │  └───────┬───────┘
                                           │          │
                                           └──────────┘
                                        new URLs → frontier
                                        content  → store

Each component is designed to be independently scalable. The URL Frontier is the heart of the system — it decides what to crawl and when. The Dedup Layer prevents wasted work. The Fetcher is the I/O-heavy component that benefits most from horizontal scaling.

Crawl Flow — Interactive Visualization

Click Step to trace a URL through the entire crawl pipeline. Watch how a seed URL enters the frontier, gets resolved, fetched, parsed, and its links are deduplicated before re-entering the frontier.

0 / 7

Click Step to begin the crawl cycle.

URL Frontier Design

The URL Frontier is the most critical component. It must answer two questions simultaneously: "What is the most important URL to crawl next?" (priority) and "Can we crawl this host right now without being rude?" (politeness). These are in tension — the highest-priority URL might belong to a host we just hit 200ms ago.

Priority Queue

URLs are scored and inserted into one of N priority buckets (e.g., N = 256). Each bucket is a FIFO queue. Higher-priority buckets are drained first.

Priority scoring factors:

PageRank / link authority: Pages linked by many high-quality pages get higher priority.
Freshness: Pages that change frequently (news homepages, stock tickers) are re-crawled sooner.
Depth: URLs discovered at shallower BFS depth are prioritized (they tend to be more important).
Domain importance: wikipedia.org ranks higher than a random personal blog.
Content type hint: .html pages are prioritized over .pdf or .zip files.

// Simplified priority assignment
function prioritize(url, metadata) {
  let score = 0;
  score += metadata.pageRank * 40;        // 0–40 points
  score += metadata.domainAuthority * 20;  // 0–20 points
  score += (10 - metadata.depth) * 3;     // 0–30: shallower = higher
  score += metadata.changeFrequency * 10; // 0–10: more changes = higher
  return Math.min(255, Math.max(0, Math.floor(score)));
}

// Insert into bucket
frontier.buckets[prioritize(url, meta)].enqueue(url);

Politeness Queue

Once a URL is selected from a priority bucket, it enters the politeness layer. This layer maintains one queue per domain. A fetcher thread can only pull from a domain queue if sufficient time has elapsed since the last request to that domain.

// Per-domain queue structure
domainQueues = {
  "example.com":     Queue([url1, url4, url9]),
  "wikipedia.org":   Queue([url2, url7]),
  "news.ycombinator.com": Queue([url3]),
  ...
}

// Rate limit: respect robots.txt Crawl-delay or default 1 req/sec
domainTimers = {
  "example.com":     { lastFetch: 1714000000, delay: 1000ms },
  "wikipedia.org":   { lastFetch: 1714000001, delay: 5000ms },
  ...
}

// Fetcher picks next available domain
function pickNext() {
  for (domain in domainQueues) {
    if (now() - domainTimers[domain].lastFetch >= domainTimers[domain].delay) {
      url = domainQueues[domain].dequeue();
      domainTimers[domain].lastFetch = now();
      return url;
    }
  }
  return null; // all domains on cooldown
}

Key insight: The two-tier design (priority → politeness) ensures that we always crawl important pages first while never overwhelming any single host. This is the architecture used by Mercator (the original crawler behind AltaVista) and its descendants.

Data Structures

At 10 billion URLs, the frontier can't fit entirely in RAM. A practical approach:

Front queues (priority): In-memory priority buckets holding the next ~1M URLs. Refilled from disk when drained.
Back queues (politeness): In-memory per-domain queues, lazily loaded. Only active domains are in memory.
Disk backing: A RocksDB or LevelDB instance stores the full frontier. URLs are batch-loaded into memory queues.
Heap-based domain selector: A min-heap keyed by lastFetch + delay tells the fetcher which domain is next eligible.

┌────────────────────────────────────────────────────┐
│                   URL FRONTIER                      │
│                                                     │
│  ┌─── Front Queues (Priority) ───┐                  │
│  │ Bucket 255 ▓▓▓▓▓ (highest)    │                  │
│  │ Bucket 254 ▓▓▓                 │                  │
│  │ Bucket 253 ▓▓▓▓               │                  │
│  │   ...                          │                  │
│  │ Bucket   0 ▓ (lowest)         │                  │
│  └────────────┬──────────────────┘                  │
│               │ drain by priority                    │
│               ▼                                      │
│  ┌─── Back Queues (Politeness) ──┐                  │
│  │ example.com    [url1, url4]   │                  │
│  │ wikipedia.org  [url2]         │                  │
│  │ github.com     [url5, url8]   │                  │
│  └────────────┬──────────────────┘                  │
│               │ pick domain with                     │
│               │ earliest eligibility                 │
│               ▼                                      │
│         ┌──────────┐                                 │
│         │ Fetcher  │                                 │
│         └──────────┘                                 │
└────────────────────────────────────────────────────┘

BFS vs DFS Traversal

The web is a directed graph. How we traverse it matters enormously:

Aspect	BFS (Breadth-First)	DFS (Depth-First)
Discovery pattern	Explores all links at depth d before d+1	Dives deep into one path before backtracking
Page importance	Finds high-authority pages first (homepages, nav pages)	May get stuck on deep, low-value paths
Politeness	Naturally spreads requests across many domains	Hammers one domain before moving on
Spider trap risk	Low — breadth limits depth exposure	High — can get trapped in infinite URL spaces
Memory	Higher (stores entire frontier at current depth)	Lower (only current path in stack)

BFS is strongly preferred for web crawling because it discovers important pages early, distributes load across domains naturally, and is resistant to spider traps. In practice, crawlers use a priority-weighted BFS where the frontier is a priority queue rather than a strict FIFO.

Robots.txt Parsing

The Robots Exclusion Protocol (RFC 9309) is the web's social contract between crawlers and site owners. A robots.txt file at the domain root tells crawlers what they may and may not access.

File Format

# Example: https://example.com/robots.txt

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search?q=
Allow: /admin/public/
Crawl-delay: 2

User-agent: Googlebot
Disallow: /no-google/
Allow: /

User-agent: BadBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Parsing Rules

User-agent matching: Find the most specific User-agent group that matches your crawler's name. If none matches, fall back to User-agent: *.
Allow vs Disallow: For a given path, the longest matching rule wins. Allow: /admin/public/ overrides Disallow: /admin/ for paths under /admin/public/.
Crawl-delay: The number of seconds to wait between successive requests to this domain. Not all crawlers respect this (Google ignores it), but a polite crawler should honor it.
Sitemap: Points to an XML sitemap for URL discovery. Crawlers should fetch and parse sitemaps to find pages that aren't linked from anywhere.
Wildcards: * matches any sequence of characters. $ matches the end of URL. E.g., Disallow: /*.pdf$ blocks all PDF files.

// Robots.txt parser (simplified)
class RobotsParser {
  constructor(robotsTxt, userAgent) {
    this.rules = [];
    this.crawlDelay = 1; // default 1 second
    this.sitemaps = [];
    this.parse(robotsTxt, userAgent);
  }

  parse(text, ua) {
    let lines = text.split('\n');
    let activeGroup = false;
    let wildcard = false;

    for (let line of lines) {
      line = line.split('#')[0].trim(); // strip comments
      if (!line) continue;

      let [directive, value] = line.split(':').map(s => s.trim());
      directive = directive.toLowerCase();

      if (directive === 'user-agent') {
        if (value === ua)       activeGroup = true;
        else if (value === '*') wildcard = true;
        else                    activeGroup = false;
      }
      else if (activeGroup || (!activeGroup && wildcard)) {
        if (directive === 'disallow' && value) {
          this.rules.push({ path: value, allowed: false });
        }
        else if (directive === 'allow' && value) {
          this.rules.push({ path: value, allowed: true });
        }
        else if (directive === 'crawl-delay') {
          this.crawlDelay = parseFloat(value) || 1;
        }
        else if (directive === 'sitemap') {
          this.sitemaps.push(value);
        }
      }
    }
    // Sort by path length descending (longest match first)
    this.rules.sort((a, b) => b.path.length - a.path.length);
  }

  isAllowed(urlPath) {
    for (let rule of this.rules) {
      if (this.matchPath(urlPath, rule.path)) {
        return rule.allowed;
      }
    }
    return true; // no matching rule = allowed
  }

  matchPath(urlPath, pattern) {
    // Convert robots.txt pattern to regex
    let regex = pattern
      .replace(/[.+?^${}()|[\]\\]/g, '\\$&') // escape special chars
      .replace(/\*/g, '.*')                      // * → .*
      .replace(/\\\$$/g, '$');                    // trailing \$ → $
    return new RegExp('^' + regex).test(urlPath);
  }
}

Caching robots.txt

Fetching robots.txt before every page is wasteful. The crawler caches it per domain with a TTL (typically 24 hours or per the HTTP Cache-Control header). A background thread refreshes expiring entries. If robots.txt returns a 4xx, the entire domain is assumed allowed. If it returns a 5xx, the crawler should back off and retry later.

Duplicate Detection

Duplicates waste bandwidth, storage, and processing time. There are two kinds:

URL Deduplication

The same page can be reached through many different URLs:

http://example.com/page vs https://example.com/page
example.com/page/ vs example.com/page (trailing slash)
example.com/page?utm_source=twitter vs example.com/page (tracking params)
example.com/Page vs example.com/page (case sensitivity)

URL canonicalization normalizes URLs before dedup:

function canonicalize(url) {
  let u = new URL(url);
  // 1. Lowercase scheme and host
  u.protocol = u.protocol.toLowerCase();
  u.hostname = u.hostname.toLowerCase();
  // 2. Remove default port
  if ((u.protocol === 'http:' && u.port === '80') ||
      (u.protocol === 'https:' && u.port === '443')) {
    u.port = '';
  }
  // 3. Remove fragment
  u.hash = '';
  // 4. Remove tracking parameters
  let strip = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content','fbclid','gclid'];
  strip.forEach(p => u.searchParams.delete(p));
  // 5. Sort remaining params
  u.searchParams.sort();
  // 6. Remove trailing slash (except root)
  if (u.pathname !== '/' && u.pathname.endsWith('/')) {
    u.pathname = u.pathname.slice(0, -1);
  }
  // 7. Decode unnecessary percent-encoding
  u.pathname = decodeURIComponent(u.pathname);
  return u.toString();
}

After canonicalization, we check membership using a Bloom filter:

// Bloom filter for URL dedup
// 10 billion URLs, 1% false positive rate → ~11.5 GB memory
// k = 7 hash functions, m = 96 billion bits

class URLBloomFilter {
  constructor(expectedItems = 10_000_000_000, fpRate = 0.01) {
    this.m = Math.ceil(-expectedItems * Math.log(fpRate) / (Math.log(2) ** 2));
    this.k = Math.round((this.m / expectedItems) * Math.log(2));
    this.bits = new BitArray(this.m);  // ~11.5 GB
  }

  add(url) {
    for (let i = 0; i < this.k; i++) {
      this.bits.set(this.hash(url, i) % this.m);
    }
  }

  mightContain(url) {
    for (let i = 0; i < this.k; i++) {
      if (!this.bits.get(this.hash(url, i) % this.m)) return false;
    }
    return true; // possibly seen (small false positive rate)
  }

  hash(url, seed) {
    return murmurHash3(url, seed); // fast, well-distributed hash
  }
}

Bloom filter trade-off: False positives mean we skip a URL we haven't actually crawled (acceptable — we'll discover it through other paths). False negatives would mean crawling a page twice (Bloom filters guarantee zero false negatives). At 1% FP rate, we miss ~1% of unique URLs — a good trade-off for 11.5 GB vs. 1 TB of full hash storage.

Content Deduplication (SimHash)

Different URLs can serve identical or near-identical content (mirrors, syndicated articles, paginated duplicates). Byte-level hashing (MD5/SHA) catches exact duplicates but misses near-duplicates that differ by ads, timestamps, or minor layout changes.

SimHash (invented by Moses Charikar, used by Google) produces a fingerprint where similar documents produce similar hashes. Two documents are near-duplicates if their SimHash values differ by only a few bits (Hamming distance ≤ 3).

SimHash Algorithm

function simHash(document, hashBits = 64) {
  // Step 1: Extract features (weighted shingles)
  let tokens = tokenize(document);
  let shingles = [];
  for (let i = 0; i <= tokens.length - 3; i++) {
    shingles.push(tokens.slice(i, i + 3).join(' ')); // 3-word shingles
  }

  // Step 2: Initialize a V vector of size hashBits, all zeros
  let V = new Array(hashBits).fill(0);

  // Step 3: For each shingle, compute its hash and update V
  for (let shingle of shingles) {
    let weight = getWeight(shingle); // TF-IDF or frequency
    let h = hash64(shingle);         // 64-bit hash

    for (let i = 0; i < hashBits; i++) {
      if ((h >> BigInt(i)) & 1n) {
        V[i] += weight;   // bit is 1 → add weight
      } else {
        V[i] -= weight;   // bit is 0 → subtract weight
      }
    }
  }

  // Step 4: Convert V to fingerprint
  let fingerprint = 0n;
  for (let i = 0; i < hashBits; i++) {
    if (V[i] > 0) {
      fingerprint |= (1n << BigInt(i));
    }
  }

  return fingerprint;
}

// Hamming distance between two SimHash fingerprints
function hammingDistance(a, b) {
  let xor = a ^ b;
  let count = 0;
  while (xor) {
    count += Number(xor & 1n);
    xor >>= 1n;
  }
  return count;
}

// Near-duplicate if Hamming distance ≤ 3
function isNearDuplicate(fp1, fp2) {
  return hammingDistance(fp1, fp2) <= 3;
}

Efficient SimHash Lookup

Checking a new fingerprint against billions of stored fingerprints for Hamming distance ≤ 3 is expensive. The table-based approach partitions the 64-bit hash into blocks and uses multiple lookup tables:

// Partition 64-bit SimHash into 4 blocks of 16 bits each
// Store in 4 tables, keyed by each block
// If two hashes differ by ≤3 bits total, at least ONE of the
// 4 blocks must be identical (pigeonhole principle)

tables = [HashMap(), HashMap(), HashMap(), HashMap()];

function store(fingerprint) {
  for (let i = 0; i < 4; i++) {
    let block = (fingerprint >> BigInt(i * 16)) & 0xFFFFn;
    tables[i].getOrCreate(block).add(fingerprint);
  }
}

function findNearDuplicates(fingerprint) {
  let candidates = new Set();
  for (let i = 0; i < 4; i++) {
    let block = (fingerprint >> BigInt(i * 16)) & 0xFFFFn;
    for (let stored of tables[i].get(block, [])) {
      if (hammingDistance(fingerprint, stored) <= 3) {
        candidates.add(stored);
      }
    }
  }
  return candidates;
}

MinHash for Jaccard Similarity

MinHash is an alternative for set-similarity detection. It estimates Jaccard similarity between two documents' shingle sets without comparing them directly.

function minHash(shingleSet, numHashes = 200) {
  let signature = new Array(numHashes).fill(Infinity);
  for (let shingle of shingleSet) {
    for (let i = 0; i < numHashes; i++) {
      let h = hashWithSeed(shingle, i);
      signature[i] = Math.min(signature[i], h);
    }
  }
  return signature;
}

// Estimated Jaccard similarity
function jaccardEstimate(sig1, sig2) {
  let matches = 0;
  for (let i = 0; i < sig1.length; i++) {
    if (sig1[i] === sig2[i]) matches++;
  }
  return matches / sig1.length;
}

// Near-duplicate if Jaccard ≥ 0.8
function isNearDup(sig1, sig2) {
  return jaccardEstimate(sig1, sig2) >= 0.8;
}

Politeness Queue — Interactive Visualization

Click Step to watch the fetcher pick from different domain queues. Notice how each domain has its own rate limit — the fetcher never sends two concurrent requests to the same domain.

0 / 8

Click Step to see politeness-based domain scheduling.

DNS Resolution Caching

DNS resolution is a hidden bottleneck. Each fetch requires resolving the hostname to an IP address. Standard OS DNS caches are too small for a crawler hitting millions of unique domains.

Crawler DNS Strategy

Local DNS cache: An in-process LRU cache holding ~1M domain → IP mappings. TTL respects the DNS record's TTL (usually 5 minutes to 24 hours).
Dedicated DNS resolver: Run a local unbound or dnsmasq instance with a large cache (1–2 GB RAM). This avoids saturating upstream DNS servers.
Batch prefetching: When loading URLs from the frontier into the politeness queue, resolve DNS for all of them in parallel using async I/O.
Fallback: If resolution fails, retry with a different DNS provider (Google DNS 8.8.8.8, Cloudflare 1.1.1.1). After 3 failures, mark the domain as unreachable and skip its URLs for 24 hours.

class DNSCache {
  constructor(maxEntries = 1_000_000) {
    this.cache = new LRUCache(maxEntries);
    this.negativeCache = new LRUCache(100_000); // failed lookups
  }

  async resolve(hostname) {
    // Check positive cache
    let cached = this.cache.get(hostname);
    if (cached && cached.expiry > Date.now()) return cached.ip;

    // Check negative cache (don't retry too soon)
    let neg = this.negativeCache.get(hostname);
    if (neg && neg.retryAfter > Date.now()) return null;

    try {
      let result = await dns.resolve4(hostname);
      this.cache.set(hostname, {
        ip: result[0],
        expiry: Date.now() + result.ttl * 1000
      });
      return result[0];
    } catch (err) {
      this.negativeCache.set(hostname, {
        retryAfter: Date.now() + 24 * 3600 * 1000 // 24h backoff
      });
      return null;
    }
  }
}

Content Parsing & Link Extraction

After fetching a page, the crawler must extract useful information and discover new URLs.

Parsing Pipeline

function processPage(url, html, headers) {
  // 1. Charset detection & decoding
  let charset = detectCharset(headers['content-type'], html);
  let text = decode(html, charset);

  // 2. Parse HTML into DOM
  let dom = parseHTML(text); // using a lenient parser (like htmlparser2)

  // 3. Check for meta robots directives
  let metaRobots = dom.querySelector('meta[name="robots"]');
  if (metaRobots) {
    let content = metaRobots.getAttribute('content').toLowerCase();
    if (content.includes('noindex')) skipIndexing(url);
    if (content.includes('nofollow')) return []; // don't extract links
  }

  // 4. Extract canonical URL
  let canonical = dom.querySelector('link[rel="canonical"]');
  if (canonical && canonical.href !== url) {
    // This page says it's a duplicate of canonical.href
    markAsDuplicate(url, canonical.href);
  }

  // 5. Extract links
  let links = [];
  for (let anchor of dom.querySelectorAll('a[href]')) {
    let href = anchor.getAttribute('href');
    let rel = anchor.getAttribute('rel') || '';

    // Skip nofollow links
    if (rel.includes('nofollow')) continue;

    // Resolve relative URLs
    let absoluteUrl = resolveUrl(url, href);

    // Filter: only http/https, skip mailto:, javascript:, etc.
    if (absoluteUrl.startsWith('http://') || absoluteUrl.startsWith('https://')) {
      links.push(canonicalize(absoluteUrl));
    }
  }

  // 6. Extract metadata
  let metadata = {
    title: dom.querySelector('title')?.textContent,
    description: dom.querySelector('meta[name="description"]')?.content,
    language: dom.documentElement.getAttribute('lang'),
    lastModified: headers['last-modified'],
  };

  return { links, metadata, text: extractText(dom) };
}

Handling Different Content Types

Content-Type	Action
`text/html`	Full parsing, link extraction, indexing
`application/pdf`	PDF text extraction (Apache Tika), index but no link extraction
`application/xml`	Check if sitemap — parse for URLs if so
`image/`, `video/`	Skip or store metadata only (alt text from referring page)
`application/javascript`	Skip — no crawlable content

Storage Architecture

URL Database

Tracks every URL the crawler has seen, with metadata:

CREATE TABLE urls (
  url_hash     BIGINT PRIMARY KEY,  -- 64-bit hash of canonical URL
  canonical_url TEXT NOT NULL,
  domain       TEXT NOT NULL,
  last_crawled TIMESTAMP,
  next_crawl   TIMESTAMP,           -- scheduled recrawl time
  http_status  SMALLINT,
  content_hash BIGINT,              -- SimHash fingerprint
  priority     SMALLINT DEFAULT 0,
  crawl_count  INT DEFAULT 0,
  change_freq  INTERVAL,            -- estimated change frequency
  INDEX idx_next_crawl (next_crawl),
  INDEX idx_domain (domain)
);

With 10 billion rows, this table is sharded by url_hash across multiple database nodes. Each shard holds ~500M–1B rows.

Content Store

Raw HTML and extracted text are stored in an object store (S3, HDFS, or Google Cloud Storage):

// Content storage layout
s3://crawler-content/
  ├── raw/          # Raw HTML responses
  │   ├── 2026/04/15/
  │   │   ├── {url_hash_1}.html.gz    # gzipped HTML
  │   │   ├── {url_hash_2}.html.gz
  │   │   └── ...
  ├── parsed/       # Extracted text + metadata (JSON)
  │   ├── 2026/04/15/
  │   │   ├── {url_hash_1}.json.gz
  │   │   └── ...
  └── index/        # Manifest files for batch processing
      ├── 2026-04-15-manifest.jsonl
      └── ...

At 100 KB average page size × 1B pages/month = ~100 TB/month. With gzip compression (~5:1 ratio), actual storage is ~20 TB/month. Lifecycle policies automatically tier older data to cheaper storage (S3 Glacier) after 90 days.

Distributed Crawling

A single machine can fetch perhaps 50–100 pages/second. At 386 pages/second sustained, we need at least 4–8 crawler nodes. At peak (1,930 pages/sec), we need 20–40 nodes.

Partitioning Strategy

Two main approaches:

Strategy	How It Works	Pros	Cons
URL hash partition	`node = hash(url) % N`	Even distribution, simple	Same domain spread across nodes — harder to enforce per-domain politeness
Domain-based partition	`node = hash(domain) % N`	All URLs for a domain on one node — easy politeness enforcement	Skewed load (wikipedia.org has vastly more URLs than a personal blog)

Domain-based partitioning is preferred because it localizes politeness state (robots.txt cache, crawl-delay timers, per-domain queues) to a single node, eliminating distributed coordination.

Coordination Architecture

┌───────────────────────────────────────────────┐
│              Coordinator Service               │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐  │
│  │ Assigner  │  │ Monitor   │  │ Rebalancer│  │
│  └───────────┘  └───────────┘  └───────────┘  │
└───────────────────────┬───────────────────────┘
                        │ domain assignments
          ┌─────────────┼─────────────┐
          ▼             ▼             ▼
   ┌────────────┐ ┌────────────┐ ┌────────────┐
   │  Crawler   │ │  Crawler   │ │  Crawler   │
   │  Node 1    │ │  Node 2    │ │  Node 3    │
   │            │ │            │ │            │
   │ domains:   │ │ domains:   │ │ domains:   │
   │ A-F hash   │ │ G-M hash   │ │ N-Z hash   │
   │ range      │ │ range      │ │ range      │
   └────────────┘ └────────────┘ └────────────┘
         │              │              │
         ▼              ▼              ▼
   ┌────────────────────────────────────────┐
   │       Shared URL Database (sharded)    │
   └────────────────────────────────────────┘
   ┌────────────────────────────────────────┐
   │       Content Store (S3 / HDFS)        │
   └────────────────────────────────────────┘

Assigner: Maps domain hash ranges to crawler nodes using consistent hashing.
Monitor: Tracks node health via heartbeats. Detects failed nodes within 30 seconds.
Rebalancer: When a node fails or is added, reassigns domain ranges with minimal disruption.

Cross-Node URL Exchange

When Crawler Node 1 finds a link to github.com (owned by Node 3), it must route the URL to the correct node:

// After extracting links from a page
for (let link of extractedLinks) {
  let targetNode = consistentHash(getDomain(link)) % numNodes;
  if (targetNode === myNodeId) {
    localFrontier.add(link);          // local — add directly
  } else {
    outboundBuffers[targetNode].add(link); // remote — buffer
  }
}

// Flush buffers periodically (every 1 second or 1000 URLs)
for (let node in outboundBuffers) {
  if (outboundBuffers[node].size > 1000 || timeSinceFlush > 1000) {
    sendBatch(node, outboundBuffers[node].drain());
  }
}

Recrawl Strategy

Not all pages change at the same rate. An effective recrawl policy prioritizes pages that are likely to have changed since the last visit.

Change Frequency Estimation

Historical change rate: If a page changed in 3 of the last 5 crawls, it changes ~60% of the time. Recrawl more often.
HTTP headers: Last-Modified and ETag allow conditional GETs (If-Modified-Since), saving bandwidth when the page hasn't changed.
Page type heuristic: News homepages change every few minutes. Product pages change weekly. Archive pages change yearly.
Sitemap <changefreq>: Site owners declare expected change frequency (hourly, daily, weekly, etc.).

// Adaptive recrawl scheduling
function scheduleRecrawl(url, crawlHistory) {
  let changes = crawlHistory.filter(c => c.changed).length;
  let total = crawlHistory.length;
  let changeRate = total > 0 ? changes / total : 0.5; // default 50%

  // Exponential backoff: pages that rarely change get crawled less often
  let baseInterval;
  if (changeRate > 0.8) baseInterval = 1 * HOUR;   // very dynamic
  else if (changeRate > 0.5) baseInterval = 6 * HOUR;
  else if (changeRate > 0.2) baseInterval = 1 * DAY;
  else if (changeRate > 0.05) baseInterval = 7 * DAY;
  else baseInterval = 30 * DAY;                       // nearly static

  // Adjust by page importance (PageRank)
  let importance = getPageRank(url);
  let interval = baseInterval / (1 + importance); // important pages crawled sooner

  return Date.now() + interval;
}

Conditional GET optimization: For recrawls, always send If-Modified-Since and If-None-Match headers. If the server returns 304 Not Modified, you save bandwidth and know the content hasn't changed — without downloading the full page.

Handling Spider Traps

Spider traps are pages (intentional or accidental) that generate an infinite number of unique URLs, causing the crawler to waste resources endlessly.

Common Trap Types

Infinite calendar: /calendar?date=2026-04-15 → links to ?date=2026-04-16 → ?date=2026-04-17 → forever.
Session IDs in URLs: /page?sid=abc123 — each visit generates a new session ID, creating a new "unique" URL.
Soft 404s: Server returns 200 OK for any URL path (/anything/you/type), making the URL space infinite.
Deliberately hostile: <a href="/trap/<random_string>"> — a page that generates links to random sub-paths.
Parameterized pagination: /search?page=1, ?page=2, ... ?page=999999999.

Defense Mechanisms

Max depth limit: Don't follow links beyond depth 15–20 from seed URLs. Deep pages are rarely valuable.
Per-domain URL budget: Cap the number of URLs crawled per domain per cycle (e.g., 100,000). Prevents any single domain from consuming the entire crawl budget.
URL pattern detection: If we see many URLs matching the same pattern with only a changing parameter (e.g., /calendar?date=*), sample a few and stop.
Content fingerprinting: If the last 10 pages from a domain all have the same SimHash, the domain is likely a trap. Deprioritize it.
URL length limit: URLs longer than 2,048 characters are suspicious. Skip them.
Blacklisting: Maintain a manual blacklist of known trap domains, updated by operators.

// Spider trap detection
class TrapDetector {
  constructor() {
    this.domainStats = new Map(); // domain → { count, patterns, hashes }
  }

  check(url, domain, contentHash) {
    let stats = this.domainStats.get(domain) || { count: 0, patterns: {}, recentHashes: [] };
    stats.count++;

    // Check 1: Per-domain URL budget
    if (stats.count > 100_000) {
      return { isTrap: true, reason: 'URL budget exceeded' };
    }

    // Check 2: URL pattern clustering
    let pattern = url.replace(/[0-9]+/g, '{N}')
                     .replace(/[a-f0-9]{8,}/gi, '{HASH}');
    stats.patterns[pattern] = (stats.patterns[pattern] || 0) + 1;
    if (stats.patterns[pattern] > 1000) {
      return { isTrap: true, reason: `Pattern "${pattern}" seen 1000+ times` };
    }

    // Check 3: Content similarity
    stats.recentHashes.push(contentHash);
    if (stats.recentHashes.length > 10) {
      stats.recentHashes.shift();
      let allSame = stats.recentHashes.every(h =>
        hammingDistance(h, stats.recentHashes[0]) <= 3
      );
      if (allSame) {
        return { isTrap: true, reason: 'Last 10 pages are near-identical' };
      }
    }

    this.domainStats.set(domain, stats);
    return { isTrap: false };
  }
}

JavaScript Rendering

Modern single-page applications (React, Angular, Vue) render content client-side. A traditional crawler that only reads raw HTML will see an empty <div id="root"></div> with no links or content.

Rendering Approaches

Approach	Description	Latency	Cost
Static HTML only	Parse raw HTML as-is	~10ms	Very low
Headless browser	Chromium (Puppeteer) executes JS, waits for DOM	2–10s	Very high (CPU + RAM)
Lightweight JS engine	V8 isolate without full browser (no layout/paint)	500ms–2s	Medium
Hybrid	Try static first; if minimal content detected, re-queue for JS rendering	Varies	Optimal

The hybrid approach is used by Google's crawler (Googlebot). It first processes the raw HTML (fast, cheap). If the page contains framework markers (<div id="root">, <app-root>, noscript tags with content hints), it's re-queued for a separate rendering service that runs headless Chrome at scale.

// Hybrid rendering decision
function needsJSRendering(html) {
  // Check for SPA framework signatures
  let spaMarkers = [
    '<div id="root"></div>',
    '<div id="app"></div>',
    '<app-root>',
    'window.__NEXT_DATA__',
    'window.__NUXT__',
  ];

  let hasMarker = spaMarkers.some(m => html.includes(m));

  // Check if meaningful content exists
  let textContent = stripTags(html).trim();
  let hasContent = textContent.length > 500; // at least 500 chars of text

  // Needs JS rendering if SPA markers present AND no meaningful static content
  return hasMarker && !hasContent;
}

// Rendering service (Puppeteer pool)
async function renderWithJS(url) {
  let browser = await browserPool.acquire();
  try {
    let page = await browser.newPage();
    await page.setRequestInterception(true);
    page.on('request', req => {
      // Block images, fonts, stylesheets — only need HTML + JS
      if (['image','stylesheet','font','media'].includes(req.resourceType())) {
        req.abort();
      } else {
        req.continue();
      }
    });
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 15000 });
    let html = await page.content();
    await page.close();
    return html;
  } finally {
    browserPool.release(browser);
  }
}

Complete Pipeline Summary

┌──────────────────────────────────────────────────────────────────────┐
│                     WEB CRAWLER PIPELINE                             │
│                                                                      │
│  ┌──────┐   ┌───────────┐   ┌────────────┐   ┌──────────────┐       │
│  │ Seed │──▶│   URL     │──▶│ robots.txt │──▶│    DNS       │       │
│  │ URLs │   │ Frontier  │   │  Checker   │   │   Resolver   │       │
│  └──────┘   │           │   └────────────┘   │   (cached)   │       │
│             │ Priority  │                     └──────┬───────┘       │
│             │ + Polite  │                            │               │
│             └─────▲─────┘                            ▼               │
│                   │                          ┌──────────────┐        │
│                   │                          │   Fetcher    │        │
│                   │                          │  (HTTP/1.1   │        │
│                   │                          │   HTTP/2)    │        │
│                   │                          └──────┬───────┘        │
│                   │                                 │                │
│                   │                                 ▼                │
│                   │                          ┌──────────────┐        │
│                   │                          │  JS Render?  │        │
│                   │                          │  (if needed) │        │
│                   │                          └──────┬───────┘        │
│                   │                                 │                │
│                   │                                 ▼                │
│                   │         ┌─────────┐     ┌──────────────┐        │
│                   │         │ Content │◀────│   Content    │        │
│                   │         │  Store  │     │   Parser     │        │
│                   │         │ (S3)    │     └──────┬───────┘        │
│                   │         └─────────┘            │                │
│                   │                                ▼                │
│                   │                         ┌──────────────┐        │
│                   │                         │     URL      │        │
│                   │                         │  Extractor   │        │
│                   │                         └──────┬───────┘        │
│                   │                                │                │
│                   │                                ▼                │
│                   │                         ┌──────────────┐        │
│                   │                         │   Dedup      │        │
│                   │                         │ (Bloom +     │        │
│                   │         new URLs        │  SimHash)    │        │
│                   └─────────────────────────┤              │        │
│                                             └──────────────┘        │
└──────────────────────────────────────────────────────────────────────┘

Key Operational Metrics

Pages crawled / second — primary throughput metric.
URL frontier size — if growing unboundedly, something is wrong (possibly a spider trap).
Duplicate rate — percentage of URLs/content filtered by dedup. Should be 30–60%.
Error rate — 4xx/5xx responses, DNS failures, timeouts. Alert if > 5%.
Politeness compliance — verify no domain receives > 1 req/sec (or its crawl-delay).
Freshness — average age of content in the index. Lower is better for news; acceptable for archives.
Rendering queue depth — how many pages are waiting for JS rendering. Monitor for backlog.

Suggested Technology Stack

Component	Technology
URL Frontier	RocksDB (disk) + in-memory priority heaps
URL Dedup	Bloom filter (in-memory, 12 GB)
Content Dedup	SimHash stored in Redis / Cassandra
DNS Cache	Local unbound + in-process LRU
Fetcher	Async HTTP client (Netty / aiohttp / Go net/http)
HTML Parser	htmlparser2 (JS) / jsoup (Java) / lxml (Python)
JS Rendering	Puppeteer / Playwright pool (headless Chromium)
Content Store	Amazon S3 / HDFS
URL Database	Sharded PostgreSQL or Cassandra
Cross-Node Messaging	Kafka / gRPC streams
Coordination	ZooKeeper / etcd for leader election + config

In the next post, we'll design a Payment System — handling money with correctness guarantees, idempotency, and distributed transaction management.