← All Posts
High Level Design Series · Real-World Designs · Post 61 of 70

Design: Web Crawler

The Problem

A web crawler (or spider) is a system that systematically browses the World Wide Web, downloading pages for indexing, archiving, or analysis. Search engines like Google, Bing, and DuckDuckGo depend on crawlers to discover and update billions of pages. The engineering challenge is immense: the web has over 4.5 billion indexed pages, each page can link to hundreds of others, and the whole corpus is in constant flux — pages are created, updated, and deleted every second.

Building a crawler that handles this at scale requires careful design of its URL scheduling (the frontier), politeness mechanisms, deduplication layers, and distributed architecture.

Requirements

Functional Requirements

Non-Functional Requirements

Back-of-the-Envelope Estimation

MetricValue
Pages / month1,000,000,000
Pages / second~386 (1B / 30 days / 86400s)
Avg page size~100 KB (HTML only)
Monthly bandwidth~100 TB (1B × 100 KB)
Peak pages / second (5×)~1,930
Unique URLs stored~10 billion (seen URLs)
URL storage (avg 100 bytes × 10B)~1 TB

High-Level Architecture

The crawler follows a loop:

Seed URLs
   │
   ▼
┌─────────────────┐     ┌──────────────┐     ┌──────────────┐
│   URL Frontier   │────▶│ DNS Resolver  │────▶│   Fetcher    │
│ (Priority Queue  │     │   (Cached)    │     │ (HTTP Client)│
│ + Politeness Q)  │     └──────────────┘     └──────┬───────┘
└────────▲─────────┘                                  │
         │                                            ▼
         │                                  ┌──────────────────┐
         │                                  │  Content Parser   │
         │                                  │ (HTML → DOM tree) │
         │                                  └──────┬───────────┘
         │                                         │
         │                                         ▼
         │                                  ┌──────────────────┐
         │                                  │  URL Extractor    │
         │                                  │ (href, canonical) │
         │                                  └──────┬───────────┘
         │                                         │
         │         ┌────────────────┐              │
         │         │  Content Store  │◀─────┐      ▼
         │         │  (S3 / HDFS)   │      │  ┌───────────────┐
         │         └────────────────┘      │  │  Dedup Layer   │
         │                                 │  │ (Bloom Filter  │
         └─────────────────────────────────┤  │ + SimHash)     │
                                           │  └───────┬───────┘
                                           │          │
                                           └──────────┘
                                        new URLs → frontier
                                        content  → store

Each component is designed to be independently scalable. The URL Frontier is the heart of the system — it decides what to crawl and when. The Dedup Layer prevents wasted work. The Fetcher is the I/O-heavy component that benefits most from horizontal scaling.

Crawl Flow — Interactive Visualization

Click Step to trace a URL through the entire crawl pipeline. Watch how a seed URL enters the frontier, gets resolved, fetched, parsed, and its links are deduplicated before re-entering the frontier.

0 / 7

Click Step to begin the crawl cycle.

URL Frontier Design

The URL Frontier is the most critical component. It must answer two questions simultaneously: "What is the most important URL to crawl next?" (priority) and "Can we crawl this host right now without being rude?" (politeness). These are in tension — the highest-priority URL might belong to a host we just hit 200ms ago.

Priority Queue

URLs are scored and inserted into one of N priority buckets (e.g., N = 256). Each bucket is a FIFO queue. Higher-priority buckets are drained first.

Priority scoring factors:

// Simplified priority assignment
function prioritize(url, metadata) {
  let score = 0;
  score += metadata.pageRank * 40;        // 0–40 points
  score += metadata.domainAuthority * 20;  // 0–20 points
  score += (10 - metadata.depth) * 3;     // 0–30: shallower = higher
  score += metadata.changeFrequency * 10; // 0–10: more changes = higher
  return Math.min(255, Math.max(0, Math.floor(score)));
}

// Insert into bucket
frontier.buckets[prioritize(url, meta)].enqueue(url);

Politeness Queue

Once a URL is selected from a priority bucket, it enters the politeness layer. This layer maintains one queue per domain. A fetcher thread can only pull from a domain queue if sufficient time has elapsed since the last request to that domain.

// Per-domain queue structure
domainQueues = {
  "example.com":     Queue([url1, url4, url9]),
  "wikipedia.org":   Queue([url2, url7]),
  "news.ycombinator.com": Queue([url3]),
  ...
}

// Rate limit: respect robots.txt Crawl-delay or default 1 req/sec
domainTimers = {
  "example.com":     { lastFetch: 1714000000, delay: 1000ms },
  "wikipedia.org":   { lastFetch: 1714000001, delay: 5000ms },
  ...
}

// Fetcher picks next available domain
function pickNext() {
  for (domain in domainQueues) {
    if (now() - domainTimers[domain].lastFetch >= domainTimers[domain].delay) {
      url = domainQueues[domain].dequeue();
      domainTimers[domain].lastFetch = now();
      return url;
    }
  }
  return null; // all domains on cooldown
}
Key insight: The two-tier design (priority → politeness) ensures that we always crawl important pages first while never overwhelming any single host. This is the architecture used by Mercator (the original crawler behind AltaVista) and its descendants.

Data Structures

At 10 billion URLs, the frontier can't fit entirely in RAM. A practical approach:

┌────────────────────────────────────────────────────┐
│                   URL FRONTIER                      │
│                                                     │
│  ┌─── Front Queues (Priority) ───┐                  │
│  │ Bucket 255 ▓▓▓▓▓ (highest)    │                  │
│  │ Bucket 254 ▓▓▓                 │                  │
│  │ Bucket 253 ▓▓▓▓               │                  │
│  │   ...                          │                  │
│  │ Bucket   0 ▓ (lowest)         │                  │
│  └────────────┬──────────────────┘                  │
│               │ drain by priority                    │
│               ▼                                      │
│  ┌─── Back Queues (Politeness) ──┐                  │
│  │ example.com    [url1, url4]   │                  │
│  │ wikipedia.org  [url2]         │                  │
│  │ github.com     [url5, url8]   │                  │
│  └────────────┬──────────────────┘                  │
│               │ pick domain with                     │
│               │ earliest eligibility                 │
│               ▼                                      │
│         ┌──────────┐                                 │
│         │ Fetcher  │                                 │
│         └──────────┘                                 │
└────────────────────────────────────────────────────┘

BFS vs DFS Traversal

The web is a directed graph. How we traverse it matters enormously:

AspectBFS (Breadth-First)DFS (Depth-First)
Discovery patternExplores all links at depth d before d+1Dives deep into one path before backtracking
Page importanceFinds high-authority pages first (homepages, nav pages)May get stuck on deep, low-value paths
PolitenessNaturally spreads requests across many domainsHammers one domain before moving on
Spider trap riskLow — breadth limits depth exposureHigh — can get trapped in infinite URL spaces
MemoryHigher (stores entire frontier at current depth)Lower (only current path in stack)

BFS is strongly preferred for web crawling because it discovers important pages early, distributes load across domains naturally, and is resistant to spider traps. In practice, crawlers use a priority-weighted BFS where the frontier is a priority queue rather than a strict FIFO.

Robots.txt Parsing

The Robots Exclusion Protocol (RFC 9309) is the web's social contract between crawlers and site owners. A robots.txt file at the domain root tells crawlers what they may and may not access.

File Format

# Example: https://example.com/robots.txt

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search?q=
Allow: /admin/public/
Crawl-delay: 2

User-agent: Googlebot
Disallow: /no-google/
Allow: /

User-agent: BadBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Parsing Rules

  1. User-agent matching: Find the most specific User-agent group that matches your crawler's name. If none matches, fall back to User-agent: *.
  2. Allow vs Disallow: For a given path, the longest matching rule wins. Allow: /admin/public/ overrides Disallow: /admin/ for paths under /admin/public/.
  3. Crawl-delay: The number of seconds to wait between successive requests to this domain. Not all crawlers respect this (Google ignores it), but a polite crawler should honor it.
  4. Sitemap: Points to an XML sitemap for URL discovery. Crawlers should fetch and parse sitemaps to find pages that aren't linked from anywhere.
  5. Wildcards: * matches any sequence of characters. $ matches the end of URL. E.g., Disallow: /*.pdf$ blocks all PDF files.
// Robots.txt parser (simplified)
class RobotsParser {
  constructor(robotsTxt, userAgent) {
    this.rules = [];
    this.crawlDelay = 1; // default 1 second
    this.sitemaps = [];
    this.parse(robotsTxt, userAgent);
  }

  parse(text, ua) {
    let lines = text.split('\n');
    let activeGroup = false;
    let wildcard = false;

    for (let line of lines) {
      line = line.split('#')[0].trim(); // strip comments
      if (!line) continue;

      let [directive, value] = line.split(':').map(s => s.trim());
      directive = directive.toLowerCase();

      if (directive === 'user-agent') {
        if (value === ua)       activeGroup = true;
        else if (value === '*') wildcard = true;
        else                    activeGroup = false;
      }
      else if (activeGroup || (!activeGroup && wildcard)) {
        if (directive === 'disallow' && value) {
          this.rules.push({ path: value, allowed: false });
        }
        else if (directive === 'allow' && value) {
          this.rules.push({ path: value, allowed: true });
        }
        else if (directive === 'crawl-delay') {
          this.crawlDelay = parseFloat(value) || 1;
        }
        else if (directive === 'sitemap') {
          this.sitemaps.push(value);
        }
      }
    }
    // Sort by path length descending (longest match first)
    this.rules.sort((a, b) => b.path.length - a.path.length);
  }

  isAllowed(urlPath) {
    for (let rule of this.rules) {
      if (this.matchPath(urlPath, rule.path)) {
        return rule.allowed;
      }
    }
    return true; // no matching rule = allowed
  }

  matchPath(urlPath, pattern) {
    // Convert robots.txt pattern to regex
    let regex = pattern
      .replace(/[.+?^${}()|[\]\\]/g, '\\$&') // escape special chars
      .replace(/\*/g, '.*')                      // * → .*
      .replace(/\\\$$/g, '$');                    // trailing \$ → $
    return new RegExp('^' + regex).test(urlPath);
  }
}

Caching robots.txt

Fetching robots.txt before every page is wasteful. The crawler caches it per domain with a TTL (typically 24 hours or per the HTTP Cache-Control header). A background thread refreshes expiring entries. If robots.txt returns a 4xx, the entire domain is assumed allowed. If it returns a 5xx, the crawler should back off and retry later.

Duplicate Detection

Duplicates waste bandwidth, storage, and processing time. There are two kinds:

URL Deduplication

The same page can be reached through many different URLs:

URL canonicalization normalizes URLs before dedup:

function canonicalize(url) {
  let u = new URL(url);
  // 1. Lowercase scheme and host
  u.protocol = u.protocol.toLowerCase();
  u.hostname = u.hostname.toLowerCase();
  // 2. Remove default port
  if ((u.protocol === 'http:' && u.port === '80') ||
      (u.protocol === 'https:' && u.port === '443')) {
    u.port = '';
  }
  // 3. Remove fragment
  u.hash = '';
  // 4. Remove tracking parameters
  let strip = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content','fbclid','gclid'];
  strip.forEach(p => u.searchParams.delete(p));
  // 5. Sort remaining params
  u.searchParams.sort();
  // 6. Remove trailing slash (except root)
  if (u.pathname !== '/' && u.pathname.endsWith('/')) {
    u.pathname = u.pathname.slice(0, -1);
  }
  // 7. Decode unnecessary percent-encoding
  u.pathname = decodeURIComponent(u.pathname);
  return u.toString();
}

After canonicalization, we check membership using a Bloom filter:

// Bloom filter for URL dedup
// 10 billion URLs, 1% false positive rate → ~11.5 GB memory
// k = 7 hash functions, m = 96 billion bits

class URLBloomFilter {
  constructor(expectedItems = 10_000_000_000, fpRate = 0.01) {
    this.m = Math.ceil(-expectedItems * Math.log(fpRate) / (Math.log(2) ** 2));
    this.k = Math.round((this.m / expectedItems) * Math.log(2));
    this.bits = new BitArray(this.m);  // ~11.5 GB
  }

  add(url) {
    for (let i = 0; i < this.k; i++) {
      this.bits.set(this.hash(url, i) % this.m);
    }
  }

  mightContain(url) {
    for (let i = 0; i < this.k; i++) {
      if (!this.bits.get(this.hash(url, i) % this.m)) return false;
    }
    return true; // possibly seen (small false positive rate)
  }

  hash(url, seed) {
    return murmurHash3(url, seed); // fast, well-distributed hash
  }
}
Bloom filter trade-off: False positives mean we skip a URL we haven't actually crawled (acceptable — we'll discover it through other paths). False negatives would mean crawling a page twice (Bloom filters guarantee zero false negatives). At 1% FP rate, we miss ~1% of unique URLs — a good trade-off for 11.5 GB vs. 1 TB of full hash storage.

Content Deduplication (SimHash)

Different URLs can serve identical or near-identical content (mirrors, syndicated articles, paginated duplicates). Byte-level hashing (MD5/SHA) catches exact duplicates but misses near-duplicates that differ by ads, timestamps, or minor layout changes.

SimHash (invented by Moses Charikar, used by Google) produces a fingerprint where similar documents produce similar hashes. Two documents are near-duplicates if their SimHash values differ by only a few bits (Hamming distance ≤ 3).

SimHash Algorithm

function simHash(document, hashBits = 64) {
  // Step 1: Extract features (weighted shingles)
  let tokens = tokenize(document);
  let shingles = [];
  for (let i = 0; i <= tokens.length - 3; i++) {
    shingles.push(tokens.slice(i, i + 3).join(' ')); // 3-word shingles
  }

  // Step 2: Initialize a V vector of size hashBits, all zeros
  let V = new Array(hashBits).fill(0);

  // Step 3: For each shingle, compute its hash and update V
  for (let shingle of shingles) {
    let weight = getWeight(shingle); // TF-IDF or frequency
    let h = hash64(shingle);         // 64-bit hash

    for (let i = 0; i < hashBits; i++) {
      if ((h >> BigInt(i)) & 1n) {
        V[i] += weight;   // bit is 1 → add weight
      } else {
        V[i] -= weight;   // bit is 0 → subtract weight
      }
    }
  }

  // Step 4: Convert V to fingerprint
  let fingerprint = 0n;
  for (let i = 0; i < hashBits; i++) {
    if (V[i] > 0) {
      fingerprint |= (1n << BigInt(i));
    }
  }

  return fingerprint;
}

// Hamming distance between two SimHash fingerprints
function hammingDistance(a, b) {
  let xor = a ^ b;
  let count = 0;
  while (xor) {
    count += Number(xor & 1n);
    xor >>= 1n;
  }
  return count;
}

// Near-duplicate if Hamming distance ≤ 3
function isNearDuplicate(fp1, fp2) {
  return hammingDistance(fp1, fp2) <= 3;
}

Efficient SimHash Lookup

Checking a new fingerprint against billions of stored fingerprints for Hamming distance ≤ 3 is expensive. The table-based approach partitions the 64-bit hash into blocks and uses multiple lookup tables:

// Partition 64-bit SimHash into 4 blocks of 16 bits each
// Store in 4 tables, keyed by each block
// If two hashes differ by ≤3 bits total, at least ONE of the
// 4 blocks must be identical (pigeonhole principle)

tables = [HashMap(), HashMap(), HashMap(), HashMap()];

function store(fingerprint) {
  for (let i = 0; i < 4; i++) {
    let block = (fingerprint >> BigInt(i * 16)) & 0xFFFFn;
    tables[i].getOrCreate(block).add(fingerprint);
  }
}

function findNearDuplicates(fingerprint) {
  let candidates = new Set();
  for (let i = 0; i < 4; i++) {
    let block = (fingerprint >> BigInt(i * 16)) & 0xFFFFn;
    for (let stored of tables[i].get(block, [])) {
      if (hammingDistance(fingerprint, stored) <= 3) {
        candidates.add(stored);
      }
    }
  }
  return candidates;
}

MinHash for Jaccard Similarity

MinHash is an alternative for set-similarity detection. It estimates Jaccard similarity between two documents' shingle sets without comparing them directly.

function minHash(shingleSet, numHashes = 200) {
  let signature = new Array(numHashes).fill(Infinity);
  for (let shingle of shingleSet) {
    for (let i = 0; i < numHashes; i++) {
      let h = hashWithSeed(shingle, i);
      signature[i] = Math.min(signature[i], h);
    }
  }
  return signature;
}

// Estimated Jaccard similarity
function jaccardEstimate(sig1, sig2) {
  let matches = 0;
  for (let i = 0; i < sig1.length; i++) {
    if (sig1[i] === sig2[i]) matches++;
  }
  return matches / sig1.length;
}

// Near-duplicate if Jaccard ≥ 0.8
function isNearDup(sig1, sig2) {
  return jaccardEstimate(sig1, sig2) >= 0.8;
}

Politeness Queue — Interactive Visualization

Click Step to watch the fetcher pick from different domain queues. Notice how each domain has its own rate limit — the fetcher never sends two concurrent requests to the same domain.

0 / 8

Click Step to see politeness-based domain scheduling.

DNS Resolution Caching

DNS resolution is a hidden bottleneck. Each fetch requires resolving the hostname to an IP address. Standard OS DNS caches are too small for a crawler hitting millions of unique domains.

Crawler DNS Strategy

class DNSCache {
  constructor(maxEntries = 1_000_000) {
    this.cache = new LRUCache(maxEntries);
    this.negativeCache = new LRUCache(100_000); // failed lookups
  }

  async resolve(hostname) {
    // Check positive cache
    let cached = this.cache.get(hostname);
    if (cached && cached.expiry > Date.now()) return cached.ip;

    // Check negative cache (don't retry too soon)
    let neg = this.negativeCache.get(hostname);
    if (neg && neg.retryAfter > Date.now()) return null;

    try {
      let result = await dns.resolve4(hostname);
      this.cache.set(hostname, {
        ip: result[0],
        expiry: Date.now() + result.ttl * 1000
      });
      return result[0];
    } catch (err) {
      this.negativeCache.set(hostname, {
        retryAfter: Date.now() + 24 * 3600 * 1000 // 24h backoff
      });
      return null;
    }
  }
}

Content Parsing & Link Extraction

After fetching a page, the crawler must extract useful information and discover new URLs.

Parsing Pipeline

function processPage(url, html, headers) {
  // 1. Charset detection & decoding
  let charset = detectCharset(headers['content-type'], html);
  let text = decode(html, charset);

  // 2. Parse HTML into DOM
  let dom = parseHTML(text); // using a lenient parser (like htmlparser2)

  // 3. Check for meta robots directives
  let metaRobots = dom.querySelector('meta[name="robots"]');
  if (metaRobots) {
    let content = metaRobots.getAttribute('content').toLowerCase();
    if (content.includes('noindex')) skipIndexing(url);
    if (content.includes('nofollow')) return []; // don't extract links
  }

  // 4. Extract canonical URL
  let canonical = dom.querySelector('link[rel="canonical"]');
  if (canonical && canonical.href !== url) {
    // This page says it's a duplicate of canonical.href
    markAsDuplicate(url, canonical.href);
  }

  // 5. Extract links
  let links = [];
  for (let anchor of dom.querySelectorAll('a[href]')) {
    let href = anchor.getAttribute('href');
    let rel = anchor.getAttribute('rel') || '';

    // Skip nofollow links
    if (rel.includes('nofollow')) continue;

    // Resolve relative URLs
    let absoluteUrl = resolveUrl(url, href);

    // Filter: only http/https, skip mailto:, javascript:, etc.
    if (absoluteUrl.startsWith('http://') || absoluteUrl.startsWith('https://')) {
      links.push(canonicalize(absoluteUrl));
    }
  }

  // 6. Extract metadata
  let metadata = {
    title: dom.querySelector('title')?.textContent,
    description: dom.querySelector('meta[name="description"]')?.content,
    language: dom.documentElement.getAttribute('lang'),
    lastModified: headers['last-modified'],
  };

  return { links, metadata, text: extractText(dom) };
}

Handling Different Content Types

Content-TypeAction
text/htmlFull parsing, link extraction, indexing
application/pdfPDF text extraction (Apache Tika), index but no link extraction
application/xmlCheck if sitemap — parse for URLs if so
image/*, video/*Skip or store metadata only (alt text from referring page)
application/javascriptSkip — no crawlable content

Storage Architecture

URL Database

Tracks every URL the crawler has seen, with metadata:

CREATE TABLE urls (
  url_hash     BIGINT PRIMARY KEY,  -- 64-bit hash of canonical URL
  canonical_url TEXT NOT NULL,
  domain       TEXT NOT NULL,
  last_crawled TIMESTAMP,
  next_crawl   TIMESTAMP,           -- scheduled recrawl time
  http_status  SMALLINT,
  content_hash BIGINT,              -- SimHash fingerprint
  priority     SMALLINT DEFAULT 0,
  crawl_count  INT DEFAULT 0,
  change_freq  INTERVAL,            -- estimated change frequency
  INDEX idx_next_crawl (next_crawl),
  INDEX idx_domain (domain)
);

With 10 billion rows, this table is sharded by url_hash across multiple database nodes. Each shard holds ~500M–1B rows.

Content Store

Raw HTML and extracted text are stored in an object store (S3, HDFS, or Google Cloud Storage):

// Content storage layout
s3://crawler-content/
  ├── raw/          # Raw HTML responses
  │   ├── 2026/04/15/
  │   │   ├── {url_hash_1}.html.gz    # gzipped HTML
  │   │   ├── {url_hash_2}.html.gz
  │   │   └── ...
  ├── parsed/       # Extracted text + metadata (JSON)
  │   ├── 2026/04/15/
  │   │   ├── {url_hash_1}.json.gz
  │   │   └── ...
  └── index/        # Manifest files for batch processing
      ├── 2026-04-15-manifest.jsonl
      └── ...

At 100 KB average page size × 1B pages/month = ~100 TB/month. With gzip compression (~5:1 ratio), actual storage is ~20 TB/month. Lifecycle policies automatically tier older data to cheaper storage (S3 Glacier) after 90 days.

Distributed Crawling

A single machine can fetch perhaps 50–100 pages/second. At 386 pages/second sustained, we need at least 4–8 crawler nodes. At peak (1,930 pages/sec), we need 20–40 nodes.

Partitioning Strategy

Two main approaches:

StrategyHow It WorksProsCons
URL hash partitionnode = hash(url) % NEven distribution, simpleSame domain spread across nodes — harder to enforce per-domain politeness
Domain-based partitionnode = hash(domain) % NAll URLs for a domain on one node — easy politeness enforcementSkewed load (wikipedia.org has vastly more URLs than a personal blog)

Domain-based partitioning is preferred because it localizes politeness state (robots.txt cache, crawl-delay timers, per-domain queues) to a single node, eliminating distributed coordination.

Coordination Architecture

┌───────────────────────────────────────────────┐
│              Coordinator Service               │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐  │
│  │ Assigner  │  │ Monitor   │  │ Rebalancer│  │
│  └───────────┘  └───────────┘  └───────────┘  │
└───────────────────────┬───────────────────────┘
                        │ domain assignments
          ┌─────────────┼─────────────┐
          ▼             ▼             ▼
   ┌────────────┐ ┌────────────┐ ┌────────────┐
   │  Crawler   │ │  Crawler   │ │  Crawler   │
   │  Node 1    │ │  Node 2    │ │  Node 3    │
   │            │ │            │ │            │
   │ domains:   │ │ domains:   │ │ domains:   │
   │ A-F hash   │ │ G-M hash   │ │ N-Z hash   │
   │ range      │ │ range      │ │ range      │
   └────────────┘ └────────────┘ └────────────┘
         │              │              │
         ▼              ▼              ▼
   ┌────────────────────────────────────────┐
   │       Shared URL Database (sharded)    │
   └────────────────────────────────────────┘
   ┌────────────────────────────────────────┐
   │       Content Store (S3 / HDFS)        │
   └────────────────────────────────────────┘

Cross-Node URL Exchange

When Crawler Node 1 finds a link to github.com (owned by Node 3), it must route the URL to the correct node:

// After extracting links from a page
for (let link of extractedLinks) {
  let targetNode = consistentHash(getDomain(link)) % numNodes;
  if (targetNode === myNodeId) {
    localFrontier.add(link);          // local — add directly
  } else {
    outboundBuffers[targetNode].add(link); // remote — buffer
  }
}

// Flush buffers periodically (every 1 second or 1000 URLs)
for (let node in outboundBuffers) {
  if (outboundBuffers[node].size > 1000 || timeSinceFlush > 1000) {
    sendBatch(node, outboundBuffers[node].drain());
  }
}

Recrawl Strategy

Not all pages change at the same rate. An effective recrawl policy prioritizes pages that are likely to have changed since the last visit.

Change Frequency Estimation

// Adaptive recrawl scheduling
function scheduleRecrawl(url, crawlHistory) {
  let changes = crawlHistory.filter(c => c.changed).length;
  let total = crawlHistory.length;
  let changeRate = total > 0 ? changes / total : 0.5; // default 50%

  // Exponential backoff: pages that rarely change get crawled less often
  let baseInterval;
  if (changeRate > 0.8) baseInterval = 1 * HOUR;   // very dynamic
  else if (changeRate > 0.5) baseInterval = 6 * HOUR;
  else if (changeRate > 0.2) baseInterval = 1 * DAY;
  else if (changeRate > 0.05) baseInterval = 7 * DAY;
  else baseInterval = 30 * DAY;                       // nearly static

  // Adjust by page importance (PageRank)
  let importance = getPageRank(url);
  let interval = baseInterval / (1 + importance); // important pages crawled sooner

  return Date.now() + interval;
}
Conditional GET optimization: For recrawls, always send If-Modified-Since and If-None-Match headers. If the server returns 304 Not Modified, you save bandwidth and know the content hasn't changed — without downloading the full page.

Handling Spider Traps

Spider traps are pages (intentional or accidental) that generate an infinite number of unique URLs, causing the crawler to waste resources endlessly.

Common Trap Types

Defense Mechanisms

  1. Max depth limit: Don't follow links beyond depth 15–20 from seed URLs. Deep pages are rarely valuable.
  2. Per-domain URL budget: Cap the number of URLs crawled per domain per cycle (e.g., 100,000). Prevents any single domain from consuming the entire crawl budget.
  3. URL pattern detection: If we see many URLs matching the same pattern with only a changing parameter (e.g., /calendar?date=*), sample a few and stop.
  4. Content fingerprinting: If the last 10 pages from a domain all have the same SimHash, the domain is likely a trap. Deprioritize it.
  5. URL length limit: URLs longer than 2,048 characters are suspicious. Skip them.
  6. Blacklisting: Maintain a manual blacklist of known trap domains, updated by operators.
// Spider trap detection
class TrapDetector {
  constructor() {
    this.domainStats = new Map(); // domain → { count, patterns, hashes }
  }

  check(url, domain, contentHash) {
    let stats = this.domainStats.get(domain) || { count: 0, patterns: {}, recentHashes: [] };
    stats.count++;

    // Check 1: Per-domain URL budget
    if (stats.count > 100_000) {
      return { isTrap: true, reason: 'URL budget exceeded' };
    }

    // Check 2: URL pattern clustering
    let pattern = url.replace(/[0-9]+/g, '{N}')
                     .replace(/[a-f0-9]{8,}/gi, '{HASH}');
    stats.patterns[pattern] = (stats.patterns[pattern] || 0) + 1;
    if (stats.patterns[pattern] > 1000) {
      return { isTrap: true, reason: `Pattern "${pattern}" seen 1000+ times` };
    }

    // Check 3: Content similarity
    stats.recentHashes.push(contentHash);
    if (stats.recentHashes.length > 10) {
      stats.recentHashes.shift();
      let allSame = stats.recentHashes.every(h =>
        hammingDistance(h, stats.recentHashes[0]) <= 3
      );
      if (allSame) {
        return { isTrap: true, reason: 'Last 10 pages are near-identical' };
      }
    }

    this.domainStats.set(domain, stats);
    return { isTrap: false };
  }
}

JavaScript Rendering

Modern single-page applications (React, Angular, Vue) render content client-side. A traditional crawler that only reads raw HTML will see an empty <div id="root"></div> with no links or content.

Rendering Approaches

ApproachDescriptionLatencyCost
Static HTML onlyParse raw HTML as-is~10msVery low
Headless browserChromium (Puppeteer) executes JS, waits for DOM2–10sVery high (CPU + RAM)
Lightweight JS engineV8 isolate without full browser (no layout/paint)500ms–2sMedium
HybridTry static first; if minimal content detected, re-queue for JS renderingVariesOptimal

The hybrid approach is used by Google's crawler (Googlebot). It first processes the raw HTML (fast, cheap). If the page contains framework markers (<div id="root">, <app-root>, noscript tags with content hints), it's re-queued for a separate rendering service that runs headless Chrome at scale.

// Hybrid rendering decision
function needsJSRendering(html) {
  // Check for SPA framework signatures
  let spaMarkers = [
    '<div id="root"></div>',
    '<div id="app"></div>',
    '<app-root>',
    'window.__NEXT_DATA__',
    'window.__NUXT__',
  ];

  let hasMarker = spaMarkers.some(m => html.includes(m));

  // Check if meaningful content exists
  let textContent = stripTags(html).trim();
  let hasContent = textContent.length > 500; // at least 500 chars of text

  // Needs JS rendering if SPA markers present AND no meaningful static content
  return hasMarker && !hasContent;
}

// Rendering service (Puppeteer pool)
async function renderWithJS(url) {
  let browser = await browserPool.acquire();
  try {
    let page = await browser.newPage();
    await page.setRequestInterception(true);
    page.on('request', req => {
      // Block images, fonts, stylesheets — only need HTML + JS
      if (['image','stylesheet','font','media'].includes(req.resourceType())) {
        req.abort();
      } else {
        req.continue();
      }
    });
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 15000 });
    let html = await page.content();
    await page.close();
    return html;
  } finally {
    browserPool.release(browser);
  }
}

Complete Pipeline Summary

┌──────────────────────────────────────────────────────────────────────┐
│                     WEB CRAWLER PIPELINE                             │
│                                                                      │
│  ┌──────┐   ┌───────────┐   ┌────────────┐   ┌──────────────┐       │
│  │ Seed │──▶│   URL     │──▶│ robots.txt │──▶│    DNS       │       │
│  │ URLs │   │ Frontier  │   │  Checker   │   │   Resolver   │       │
│  └──────┘   │           │   └────────────┘   │   (cached)   │       │
│             │ Priority  │                     └──────┬───────┘       │
│             │ + Polite  │                            │               │
│             └─────▲─────┘                            ▼               │
│                   │                          ┌──────────────┐        │
│                   │                          │   Fetcher    │        │
│                   │                          │  (HTTP/1.1   │        │
│                   │                          │   HTTP/2)    │        │
│                   │                          └──────┬───────┘        │
│                   │                                 │                │
│                   │                                 ▼                │
│                   │                          ┌──────────────┐        │
│                   │                          │  JS Render?  │        │
│                   │                          │  (if needed) │        │
│                   │                          └──────┬───────┘        │
│                   │                                 │                │
│                   │                                 ▼                │
│                   │         ┌─────────┐     ┌──────────────┐        │
│                   │         │ Content │◀────│   Content    │        │
│                   │         │  Store  │     │   Parser     │        │
│                   │         │ (S3)    │     └──────┬───────┘        │
│                   │         └─────────┘            │                │
│                   │                                ▼                │
│                   │                         ┌──────────────┐        │
│                   │                         │     URL      │        │
│                   │                         │  Extractor   │        │
│                   │                         └──────┬───────┘        │
│                   │                                │                │
│                   │                                ▼                │
│                   │                         ┌──────────────┐        │
│                   │                         │   Dedup      │        │
│                   │                         │ (Bloom +     │        │
│                   │         new URLs        │  SimHash)    │        │
│                   └─────────────────────────┤              │        │
│                                             └──────────────┘        │
└──────────────────────────────────────────────────────────────────────┘

Key Operational Metrics

Suggested Technology Stack

ComponentTechnology
URL FrontierRocksDB (disk) + in-memory priority heaps
URL DedupBloom filter (in-memory, 12 GB)
Content DedupSimHash stored in Redis / Cassandra
DNS CacheLocal unbound + in-process LRU
FetcherAsync HTTP client (Netty / aiohttp / Go net/http)
HTML Parserhtmlparser2 (JS) / jsoup (Java) / lxml (Python)
JS RenderingPuppeteer / Playwright pool (headless Chromium)
Content StoreAmazon S3 / HDFS
URL DatabaseSharded PostgreSQL or Cassandra
Cross-Node MessagingKafka / gRPC streams
CoordinationZooKeeper / etcd for leader election + config

In the next post, we'll design a Payment System — handling money with correctness guarantees, idempotency, and distributed transaction management.