High Level Design Series · Real-World Designs· Part 44 of 70

Design: Pastebin

April 2026 · 24 min read

Problem Statement

Pastebin is a web service that allows users to store and share plain-text or code snippets via a unique URL. Users paste content into a text area, receive a short link, and anyone with that link can read the content. Pastes can optionally expire, be password-protected, or be made private to the creator.

The core challenge is deceptively simple — accept text, store it, and return a link. But when you're handling 10 million pastes per day with a read-heavy workload, the system must manage massive storage volumes, serve reads with sub-100ms latency, and handle abuse at scale. The design decisions around where to store content (database vs. object storage), how to generate unique keys, how to expire old pastes efficiently, and how to cache hot content are what make this a rich system design problem.

Why Pastebin? This problem sits at the intersection of URL shortening, object storage, and content delivery. It tests your understanding of blob storage, caching layers, key generation, and TTL management — all fundamental building blocks in distributed systems.

Popular services in this space include Pastebin.com, GitHub Gist, Hastebin, and Ghostbin. They share a common architecture but differ in features like collaboration, syntax highlighting, revision history, and monetization models.

Requirements

Functional Requirements

Create paste: Users can upload text/code and receive a unique short URL (e.g., pastebin.com/aB3kX9)
Read paste: Anyone with the URL can view the content. The system renders the text with optional syntax highlighting
Delete paste: Creators can delete their own pastes. The system also auto-deletes expired pastes
Custom alias: Users may optionally specify a custom URL slug (e.g., pastebin.com/my-config)
Syntax highlighting: Support language-based syntax highlighting for at least 50 languages
Expiration: Pastes can have a TTL — 10 minutes, 1 hour, 1 day, 1 week, 1 month, or never
Visibility: Public pastes are listed and searchable; unlisted pastes are only accessible via direct URL; private pastes require authentication
User accounts (optional): Registered users can manage their pastes — view history, delete, edit

Non-Functional Requirements

High availability: The system must be available 99.9% of the time (≤ 8.76 hours downtime/year)
Low read latency: P99 read latency under 100ms for cached content, under 300ms for cold reads
Durability: Once a paste is created, it must not be lost until it expires or is deleted
Scalability: Handle 10M new pastes/day with a 5:1 read-to-write ratio
Consistency: Eventual consistency is acceptable — a paste may take a few seconds to propagate to all read replicas
Content size limit: Max 10 MB per paste (average ~10 KB)

Out of Scope

Real-time collaborative editing (Google Docs-style)
Version control / diff between edits
File uploads (images, binaries)
Commenting or social features

Capacity Estimation

Back-of-the-envelope calculations set the guardrails for our architecture. Let's work through the numbers methodically.

Traffic

Metric	Calculation	Value
New pastes / day	Given	10 M
Writes / second	10M / 86,400	~116 writes/s
Read:Write ratio	Given	5:1
Reads / day	10M × 5	50 M
Reads / second	50M / 86,400	~580 reads/s

At peak (assume 3× average), we need to handle ~350 writes/s and ~1,740 reads/s. These are comfortable numbers for a horizontally-scaled system with caching.

Storage

Metric	Calculation	Value
Average paste size	Given	10 KB
Content storage / day	10M × 10 KB	~100 GB/day
Content storage / year	100 GB × 365	~36.5 TB/year
Content storage / 5 years	36.5 TB × 5	~182 TB
Total pastes in 5 years	10M × 365 × 5	~18.25 B
Metadata per paste	~500 bytes (ID, user, lang, expiry, timestamps, visibility)	~9 TB metadata in 5 years

Key insight: 182 TB of content over 5 years is far too much for a relational database, but trivial for object storage like Amazon S3 (which can hold exabytes). This is the fundamental reason we separate metadata from content: metadata goes into a database, content goes into S3/blob storage.

Bandwidth

Direction	Calculation	Bandwidth
Ingress (writes)	116 writes/s × 10 KB	~1.16 MB/s
Egress (reads)	580 reads/s × 10 KB	~5.8 MB/s

These bandwidth numbers are modest. Even with 3× peak multiplier, we're under 20 MB/s egress — well within the capacity of a single CDN edge node. A CDN will absorb the vast majority of read traffic for popular pastes.

Key Length

We need unique keys for 18.25 billion pastes over 5 years. Using Base62 (a–z, A–Z, 0–9):

6 characters → 62⁶ = 56.8 billion combinations
7 characters → 62⁷ = 3.52 trillion combinations
8 characters → 62⁸ = 218 trillion combinations

A 6-character Base62 key gives us 56.8B combinations — more than 3× the 18.25B pastes we expect. Adding 2 characters (8 total) provides a 12,000× safety margin. We'll use 8 characters for comfortable headroom and negligible collision probability.

API Design

We'll expose a RESTful API. Authentication is via API keys for programmatic access or session tokens for web users.

Create Paste

POST /api/v1/pastes
Content-Type: application/json
Authorization: Bearer <api_key>  (optional for anonymous pastes)

{
  "content":     "def hello():\n    print('Hello, world!')",
  "title":       "My Python Snippet",       // optional
  "language":    "python",                   // optional, for syntax highlighting
  "expiration":  "1d",                       // 10m, 1h, 1d, 1w, 1m, never
  "visibility":  "unlisted",                 // public, unlisted, private
  "custom_alias": "my-snippet",              // optional custom URL slug
  "password":    "s3cret"                    // optional password protection
}

Response (201 Created):
{
  "id":          "aB3kX9Qm",
  "url":         "https://pastebin.com/aB3kX9Qm",
  "title":       "My Python Snippet",
  "language":    "python",
  "visibility":  "unlisted",
  "expires_at":  "2026-04-16T12:00:00Z",
  "created_at":  "2026-04-15T12:00:00Z",
  "size_bytes":  42
}

Read Paste

GET /api/v1/pastes/{paste_id}
Authorization: Bearer <api_key>  (required for private pastes)

Response (200 OK):
{
  "id":          "aB3kX9Qm",
  "title":       "My Python Snippet",
  "content":     "def hello():\n    print('Hello, world!')",
  "language":    "python",
  "visibility":  "unlisted",
  "view_count":  42,
  "created_at":  "2026-04-15T12:00:00Z",
  "expires_at":  "2026-04-16T12:00:00Z"
}

// For password-protected pastes:
GET /api/v1/pastes/{paste_id}?password=s3cret

Delete Paste

DELETE /api/v1/pastes/{paste_id}
Authorization: Bearer <api_key>

Response (204 No Content)

// Only the creator can delete. Attempting to delete
// someone else's paste returns 403 Forbidden.

List User Pastes

GET /api/v1/users/me/pastes?page=1&limit=20
Authorization: Bearer <api_key>

Response (200 OK):
{
  "pastes": [
    { "id": "aB3kX9Qm", "title": "My Python Snippet", ... },
    { "id": "xY7pR2Ln", "title": "Nginx Config", ... }
  ],
  "total": 147,
  "page": 1,
  "limit": 20
}

Rate Limiting Headers

Every response includes rate limiting headers:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 57
X-RateLimit-Reset: 1713184800

Anonymous users: 10 creates/hour, 60 reads/minute
Registered users: 100 creates/hour, 300 reads/minute
Premium users: 1000 creates/hour, 3000 reads/minute

Database Design

We separate concerns: metadata in a database, content in object storage. This is the most critical design decision for Pastebin.

Why Not Store Content in the Database?

Size: 182 TB of content in 5 years — relational databases struggle with this volume, and storage costs are 5–10× higher than S3
Performance: Large TEXT/BLOB columns degrade query performance and make backups painfully slow
Scaling: Sharding a database with large blobs is far harder than sharding metadata-only tables
Cost: S3 costs ~$0.023/GB/month. RDS storage costs ~$0.115/GB/month — a 5× difference

Paste Metadata Schema (SQL)

CREATE TABLE pastes (
    id              VARCHAR(8) PRIMARY KEY,    -- Base62 unique key
    title           VARCHAR(255),
    user_id         BIGINT,                    -- NULL for anonymous pastes
    language        VARCHAR(50) DEFAULT 'text',
    visibility      ENUM('public','unlisted','private') DEFAULT 'unlisted',
    password_hash   VARCHAR(255),              -- bcrypt hash if password-protected
    content_key     VARCHAR(255) NOT NULL,     -- S3 object key
    size_bytes      INT NOT NULL,
    view_count      BIGINT DEFAULT 0,
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at      DATETIME,                  -- NULL = never expires
    deleted_at      DATETIME,                  -- soft delete
    INDEX idx_user_id (user_id),
    INDEX idx_expires (expires_at),
    INDEX idx_visibility_created (visibility, created_at DESC)
);

User Schema

CREATE TABLE users (
    id              BIGINT AUTO_INCREMENT PRIMARY KEY,
    username        VARCHAR(50) UNIQUE NOT NULL,
    email           VARCHAR(255) UNIQUE NOT NULL,
    password_hash   VARCHAR(255) NOT NULL,
    api_key         VARCHAR(64) UNIQUE NOT NULL,
    tier            ENUM('free','premium') DEFAULT 'free',
    created_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_api_key (api_key)
);

Storage Layout

Each paste's content is stored in object storage (S3) with the key derived from the paste ID:

s3://pastebin-content/{shard}/{paste_id}

Examples:
  s3://pastebin-content/aB/aB3kX9Qm      → shard by first 2 chars
  s3://pastebin-content/xY/xY7pR2Ln
  s3://pastebin-content/Z9/Z9mNpQ4w

Sharding the S3 prefix by the first 2 characters prevents hot-prefix issues and distributes requests across partitions. With Base62, this gives us 62² = 3,844 top-level prefixes — excellent distribution.

Database Choice: SQL vs NoSQL

Factor	SQL (PostgreSQL / MySQL)	NoSQL (DynamoDB / Cassandra)
Schema	Fixed schema, migrations required	Flexible schema, easy to evolve
Queries	Rich queries, JOINs, aggregations	Key-value lookups, limited queries
Write scaling	Vertical then sharding	Horizontal from day one
Consistency	Strong (ACID)	Tunable (eventual by default)
Verdict	Either works. SQL for admin queries and analytics; NoSQL for pure key-value at extreme scale. We'll use PostgreSQL initially — the metadata is small (~9 TB in 5 years), relational queries are useful, and we can shard later.

High-Level Architecture

The system has five major layers:

CDN layer: CloudFront / Cloudflare caches popular pastes at the edge
API server layer: Stateless application servers behind a load balancer
Cache layer: Redis cluster caches hot paste metadata and content
Metadata store: PostgreSQL with read replicas
Content store: Amazon S3 (or equivalent object storage) for the actual paste text

                        ┌─────────────┐
                        │   Clients   │
                        │ (Web / API) │
                        └──────┬──────┘
                               │
                        ┌──────▼──────┐
                        │     CDN     │ ◄── Caches GET /paste/{id} responses
                        │ CloudFront  │
                        └──────┬──────┘
                               │ cache miss
                        ┌──────▼──────┐
                        │    Load     │
                        │  Balancer   │
                        └──────┬──────┘
                               │
                 ┌─────────────┼─────────────┐
                 │             │             │
          ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
          │  API Server │ │  API Server │ │  API Server │
          │    (N=3+)   │ │    (N=3+)   │ │    (N=3+)   │
          └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
                 │               │               │
         ┌───────┼───────────────┼───────────────┼───────┐
         │       │               │               │       │
  ┌──────▼──────┐│        ┌──────▼──────┐        │┌──────▼──────┐
  │    Redis    ││        │ PostgreSQL  │        ││   Amazon    │
  │   Cluster   ││        │  (Primary)  │        ││     S3      │
  │  (Cache)    ││        │             │        ││  (Content)  │
  └─────────────┘│        └──────┬──────┘        │└─────────────┘
                 │        ┌──────▼──────┐        │
                 │        │  Read       │        │
                 │        │  Replicas   │        │
                 │        └─────────────┘        │
                 │                               │
          ┌──────▼──────┐                 ┌──────▼──────┐
          │ Key Gen Svc │                 │  Cleanup    │
          │ (pre-gen    │                 │  Worker     │
          │  unique IDs)│                 │ (expiration)│
          └─────────────┘                 └─────────────┘

Create Paste Flow

When a user creates a paste, the request flows through multiple components. Let's trace the complete write path:

Client sends a POST request with paste content, language, expiration, and visibility
API Server validates the request — checks content size (≤ 10 MB), rate limits, and input sanitization
Key Generation Service provides a unique 8-character Base62 key (pre-generated from a pool)
S3 Upload: Content is stored in S3 at key s3://pastebin-content/{shard}/{paste_id}
Metadata Insert: A row is inserted into PostgreSQL with the paste ID, user ID, language, expiry, S3 key, etc.
Cache Warm: The paste metadata + content are written to Redis so the first read is fast
Response: The API returns the paste URL to the client

▶ Create Paste Flow

Step through the write path: from client request to stored paste and returned URL.

Why S3 before DB? We write content to S3 first because S3 is more durable (11 nines). If the DB insert fails after S3 upload, we lose a metadata row (which can be retried or cleaned up). If the DB insert succeeds but S3 fails, the paste is broken — the user has a URL pointing to missing content. Always write the hard-to-recover data first.

Read Paste Flow

Reads are the dominant traffic pattern (5× writes). We optimize the read path with multiple cache layers:

CDN check: If the paste is popular and public, CloudFront serves it from the edge — no request reaches our servers
Redis check: On CDN miss, the API server checks Redis for cached content
Database check: On cache miss, fetch metadata from PostgreSQL (read replica)
S3 fetch: Use the content_key from metadata to fetch the actual content from S3
Cache populate: Store the result in Redis (with TTL) for subsequent reads
Return response: Send the paste content back to the client (CDN may cache it for future requests)

▶ Read Paste Flow

Step through the read path: CDN → cache → database → object storage.

In practice, the CDN and Redis cache will absorb 80–90% of all reads. A Zipfian distribution means a small fraction of pastes receive the majority of views — and those are precisely the ones sitting in cache.

Key Generation

Generating unique, short, URL-safe keys is the same challenge as a URL shortener. Let's evaluate three approaches:

Approach 1: Hash + Truncate

Hash the content (e.g., MD5 or SHA-256) and take the first 8 characters of the Base62-encoded hash.

key = base62_encode(md5(content + timestamp + user_id))[:8]
// Example: "aB3kX9Qm"

Pros: Simple, deterministic for same content
Cons: Collisions possible (birthday paradox — ~50% chance of collision after 62⁴ = 14.7M keys with 8 chars). Need collision handling loop

Approach 2: Pre-Generated Key Pool (Recommended)

A dedicated Key Generation Service (KGS) pre-generates unique 8-character Base62 keys and stores them in a separate database. When an API server needs a key, it takes one from the pool.

// Key Generation Service
// Pre-generate and store in a two-table system:

CREATE TABLE unused_keys (
    key_value VARCHAR(8) PRIMARY KEY
);

CREATE TABLE used_keys (
    key_value VARCHAR(8) PRIMARY KEY
);

// Batch fetch: API server requests N keys at startup
SELECT key_value FROM unused_keys LIMIT 1000;
// Move to used_keys in a transaction
// API server keeps them in an in-memory buffer

Pros: Zero collisions (keys are unique by construction), very fast (in-memory buffer), no runtime computation
Cons: Extra service to manage, need to pre-populate keys

How many keys to pre-generate? At 10M pastes/day, we use 10M keys/day. Pre-generating 100M keys (10 days' worth) takes about 3 minutes and occupies ~800 MB of storage. The KGS can replenish in the background.

Approach 3: Snowflake-style ID

Generate a 64-bit unique ID (timestamp + machine ID + sequence) and Base62-encode it. Produces 11-character strings — longer than ideal for a short URL but guaranteed unique without coordination.

Our Choice: Pre-Generated Key Pool

The KGS approach gives us the best trade-off: short keys (8 chars), zero collision probability, no per-request computation, and clean separation of concerns. Each API server fetches a batch of keys on startup and refills when its buffer runs low.

Handling Custom Aliases

When a user requests a custom alias like pastebin.com/my-config:

Check if the alias is already taken in the pastes table
If available, use it as the paste ID (skip the KGS)
If taken, return HTTP 409 Conflict
Custom aliases must be 3–30 characters, alphanumeric + hyphens only

Content Storage

The content store is the heart of Pastebin. Let's go deep on why object storage is the right choice and how to optimize it.

Why S3 / Blob Storage?

Property	S3	Database (BLOB column)
Durability	99.999999999% (11 nines)	Depends on backup strategy
Cost per GB/month	$0.023	$0.115+ (RDS gp3)
Max object size	5 TB	1 GB (MySQL LONGBLOB)
Throughput	5,500 GET/s, 3,500 PUT/s per prefix	Limited by connections
Backups	Cross-region replication built-in	Manual snapshots
CDN integration	CloudFront origin natively	Requires app-layer proxy

Storage Tiers for Cost Optimization

Not all pastes are accessed equally. We can use S3 lifecycle policies to move cold pastes to cheaper tiers:

// S3 Lifecycle Policy
{
  "Rules": [
    {
      "ID": "Transition-to-IA",
      "Status": "Enabled",
      "Transitions": [
        { "Days": 30,  "StorageClass": "STANDARD_IA" },  // $0.0125/GB
        { "Days": 180, "StorageClass": "GLACIER_IR" }     // $0.004/GB
      ]
    }
  ]
}

This reduces our 5-year storage cost from ~$50K/year (all Standard) to ~$12K/year (tiered). A 75% cost reduction with no impact on active pastes.

Compression

Text compresses extremely well. Using gzip or zstd before storing to S3:

Average compression ratio for code/text: 3–5×
10 KB average paste → ~2.5 KB after compression
182 TB → ~45 TB effective storage over 5 years
The API server compresses before upload and decompresses on read. The extra CPU is negligible compared to the storage savings.

Caching Strategy

With a 5:1 read:write ratio and Zipfian access patterns (a few pastes get most views), caching is critical.

Cache Layer 1: Redis

A Redis cluster sits between the API servers and the database/S3. We cache two things:

Paste metadata: key → serialized metadata (small, ~500 bytes)
Paste content: key → compressed content (avg ~2.5 KB after gzip)

// Redis key scheme
paste:meta:{paste_id}   → JSON metadata    TTL: 1 hour
paste:content:{paste_id} → gzipped content  TTL: 1 hour

// Cache-aside pattern (read path):
content = redis.get("paste:content:" + pasteId)
if content == null:
    metadata = db.query("SELECT * FROM pastes WHERE id = ?", pasteId)
    content = s3.getObject(metadata.content_key)
    redis.setex("paste:content:" + pasteId, 3600, content)
    redis.setex("paste:meta:" + pasteId, 3600, metadata)
return content

Cache Sizing

How much Redis memory do we need? Following the 80/20 rule: 20% of pastes account for 80% of reads.

Daily unique pastes read: ~50M / 5 (assume each hot paste is read 5 times) = ~10M unique pastes
Cache the top 20%: 2M pastes × (500 bytes metadata + 2.5 KB content) = ~6 GB
A single Redis node (64 GB) can hold 10× this easily
We'll run a 3-node Redis cluster for high availability (primary + 2 replicas)

Cache Layer 2: CDN

For public, non-expiring pastes, the CDN (CloudFront/Cloudflare) caches the rendered HTML response at edge nodes:

// CDN cache headers for public pastes
Cache-Control: public, max-age=300, s-maxage=3600
Vary: Accept-Encoding

// CDN cache headers for private/unlisted pastes
Cache-Control: private, no-store
// (CDN will not cache these)

Public pastes: cached for 1 hour at the CDN edge (s-maxage=3600)
Unlisted pastes: cached briefly (max-age=300) — accessible by URL but not indexed
Private pastes: never cached at the CDN (no-store)

Cache Invalidation

When a paste is deleted or updated:

Delete from Redis: redis.del("paste:meta:" + id, "paste:content:" + id)
Purge from CDN: cloudfront.createInvalidation("/" + pasteId)
For expired pastes, the cache TTL naturally evicts them — no active invalidation needed

Paste Expiration

Expiration is essential to prevent unbounded storage growth. Our 182 TB estimate assumes no expiration — with it, actual storage will be significantly less.

Lazy Expiration (Read-Time Check)

On every read, check if the paste has expired:

metadata = fetchPasteMetadata(pasteId)
if metadata.expires_at != null && metadata.expires_at < now():
    return 404 "Paste expired"
// Serve the paste normally

This is simple and catches all expired pastes on access. But it doesn't reclaim storage — expired pastes still sit in S3 and the database until cleaned up.

Active Expiration (Background Cleanup)

A background Cleanup Worker runs periodically (every 5 minutes) to find and delete expired pastes:

-- Find expired pastes in batches
SELECT id, content_key FROM pastes
WHERE expires_at IS NOT NULL
  AND expires_at < NOW()
  AND deleted_at IS NULL
ORDER BY expires_at ASC
LIMIT 1000;

-- For each expired paste:
-- 1. Delete from S3:     s3.deleteObject(content_key)
-- 2. Delete from Redis:  redis.del("paste:meta:" + id, "paste:content:" + id)
-- 3. Soft-delete in DB:  UPDATE pastes SET deleted_at = NOW() WHERE id = ?

Why Soft Delete?

We use soft deletes (deleted_at timestamp) instead of hard deletes because:

Audit trail: We can track when and why pastes were removed
Abuse investigation: Deleted pastes can be reviewed for policy violations
Undo: Accidental deletions can be reversed within a grace period
A separate purge job hard-deletes rows older than 30 days from the deleted_at timestamp

TTL in Redis

When caching a paste with an expiration, set the Redis TTL to min(1 hour, time_until_expiry). This ensures the cache never serves stale content past the paste's expiration.

Scaling

Let's scale each component for our target: 10M writes/day, 50M reads/day, with 3× peak bursts.

API Servers

Stateless — scale horizontally behind a load balancer (ALB/NLB)
Each server handles ~500 requests/s → need 4–5 servers at peak (1,740 reads + 350 writes = ~2,100 req/s)
Auto-scaling group: min 3, max 10, scale on CPU (70%) or request count

Database Scaling

PostgreSQL with read replicas handles our workload comfortably:

Write path: Single primary handles 350 writes/s easily (PostgreSQL can handle 10K+ writes/s)
Read path: 3–5 read replicas with connection pooling (PgBouncer). Each replica handles ~1,000 reads/s
Connection pooling: PgBouncer limits active connections to prevent overwhelming the database

Database Sharding (When Needed)

When the metadata table exceeds ~1 TB (roughly after 2 years), shard by paste ID:

// Shard assignment: hash the paste ID
shard_number = hash(paste_id) % num_shards

// With 16 shards:
// Each shard holds ~1.14B pastes (after 5 years)
// Each shard is ~562 GB — comfortable for a single PostgreSQL instance

// The first 2 characters of the paste ID provide natural sharding:
// shard = first_2_chars(paste_id) % num_shards

S3 Scaling

S3 scales automatically — it's one of the most scalable services in AWS. With our prefix-based sharding (first 2 characters), we distribute load across thousands of partitions. No action needed.

Redis Scaling

Start with a single Redis primary + 2 replicas (Redis Cluster mode)
At scale, use Redis Cluster with 6+ nodes and data sharding across slots
Key distribution is naturally uniform (Base62 paste IDs)

CDN Scaling

The CDN absorbs the bulk of read traffic. For popular pastes (e.g., shared on social media), the CDN can serve millions of requests/second from edge nodes without any load on our origin servers.

Geographic Distribution

For global users, deploy in multiple regions:

Region          Components          Purpose
────────────────────────────────────────────────────────
us-east-1       Full stack           Primary region
eu-west-1       API + Redis + DB     European users
ap-southeast-1  API + Redis + DB     Asian users
────────────────────────────────────────────────────────
S3: Cross-region replication to all regions
CDN: Global edge network (200+ PoPs)

Abuse Prevention

Pastebin services are notoriously abused for malware distribution, credential dumps, phishing pages, and spam. A robust abuse prevention system is critical.

Rate Limiting

Apply rate limits at multiple levels:

Level	Limit	Implementation
IP-based	10 creates/hour	Redis sliding window counter
User-based	100 creates/hour	Token bucket per API key
Global	50K creates/hour	Circuit breaker at load balancer
Read rate	300 reads/min per IP	CDN-level WAF rule

Spam Detection

Content hashing: Hash each paste and check against known-spam hashes. Block duplicates of known malicious content
Link density: Flag pastes with an unusually high ratio of URLs to text — a strong indicator of SEO spam
Keyword filters: Block or flag pastes containing known phishing patterns, credential formats, or malware signatures
ML classifier: At scale, train a classifier on reported-spam pastes to auto-flag new ones

Content Moderation

// Moderation pipeline
1. User creates paste
2. Paste is stored and available immediately (optimistic)
3. Async: Content is sent to moderation queue
4. Automated checks run:
   a. Regex patterns for credentials, SSNs, credit cards
   b. URL reputation check (Google Safe Browsing API)
   c. Spam classifier score
5. If flagged → paste is hidden, creator notified
6. If score is borderline → human review queue
7. If clean → no action needed

CAPTCHA

Anonymous paste creation requires a CAPTCHA (reCAPTCHA v3 or hCaptcha) to prevent automated spam bots. Authenticated users with good history bypass CAPTCHA.

Reporting

Every paste page includes a "Report Abuse" button. Reports go into a moderation queue with priority based on the number of unique reporters and the paste's view count.

Additional Considerations

Syntax Highlighting

Syntax highlighting is done client-side using a library like Prism.js or highlight.js. The server stores raw text — the browser renders the highlighted version. This keeps the server stateless and avoids storing rendered HTML.

// Client-side rendering
<pre><code class="language-python">
  {{ raw paste content }}
</code></pre>
<script src="prism.js"></script>

Analytics

Track per-paste view counts without hitting the database on every read:

Increment view count in Redis: INCR paste:views:{id}
Flush to database in batches (every 5 minutes or every 100 increments)
This decouples the hot read path from database writes

Search

For public paste search, use Elasticsearch:

Index paste metadata (title, language, tags) and optionally the first 1 KB of content
Full-text search with language-aware analyzers
Only public pastes are indexed — unlisted and private are excluded

Monitoring & Alerting

Key metrics to monitor:

Create latency P99: Alert if > 500ms (target: < 200ms)
Read latency P99: Alert if > 300ms (target: < 100ms for cached)
Cache hit rate: Alert if < 70% (target: > 85%)
S3 error rate: Alert if > 0.1%
Expired pastes queue depth: Alert if cleanup worker falls behind by > 100K pastes
Abuse reports/hour: Alert if spike > 3× normal rate

Complete Architecture

Putting it all together, here's the complete architecture with all components and their interactions:

                          ┌──────────────────────────────────────────────┐
                          │              CLIENTS                        │
                          │  Web Browser  │  CLI Tool  │  API Client    │
                          └──────────┬─────────────────┬───────────────┘
                                     │                 │
                          ┌──────────▼─────────────────▼───────────────┐
                          │           CDN (CloudFront)                  │
                          │  • Caches public paste responses            │
                          │  • WAF rules for rate limiting              │
                          │  • SSL termination                          │
                          │  • 200+ global edge locations               │
                          └──────────────────┬─────────────────────────┘
                                             │ cache miss
                          ┌──────────────────▼─────────────────────────┐
                          │        Load Balancer (ALB)                  │
                          │  • Health checks on API servers             │
                          │  • SSL offloading                           │
                          │  • Sticky sessions (optional)               │
                          └────┬──────────┬──────────┬─────────────────┘
                               │          │          │
                       ┌───────▼──┐ ┌─────▼────┐ ┌──▼───────┐
                       │ API Srv 1│ │ API Srv 2│ │ API Srv N│
                       │(stateless│ │(stateless│ │(stateless│
                       └───┬──┬───┘ └──┬──┬────┘ └──┬──┬────┘
                           │  │        │  │         │  │
             ┌─────────────┘  │        │  │         │  └────────────┐
             │                │        │  │         │               │
      ┌──────▼──────┐ ┌──────▼────────▼──▼─────────▼──────┐ ┌─────▼──────┐
      │ Key Gen Svc │ │         Redis Cluster              │ │  Amazon S3 │
      │             │ │  • paste:meta:{id}                 │ │            │
      │ Pre-gen pool│ │  • paste:content:{id}              │ │ /shard/id  │
      │ of 8-char   │ │  • paste:views:{id}                │ │ gzipped    │
      │ Base62 keys │ │  • ratelimit:{ip}                  │ │ content    │
      └─────────────┘ └──────────────┬─────────────────────┘ └────────────┘
                                     │ cache miss
                      ┌──────────────▼─────────────────────┐
                      │      PostgreSQL (Primary)           │
                      │  • pastes table                     │
                      │  • users table                      │
                      └──────────┬─────────────────────────┘
                      ┌──────────▼─────────────────────────┐
                      │     Read Replicas (3-5)             │
                      │  • Serve all read queries           │
                      │  • Async replication from primary   │
                      └────────────────────────────────────┘

    ┌──────────────────────────────────────────────────────────────────────┐
    │                    BACKGROUND WORKERS                                │
    │                                                                      │
    │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌────────────┐ │
    │  │  Cleanup    │  │ View Count  │  │  Content    │  │ Analytics  │ │
    │  │  Worker     │  │  Flusher    │  │  Moderator  │  │ Aggregator │ │
    │  │ (expire     │  │ (Redis →    │  │ (spam/abuse │  │ (metrics   │ │
    │  │  pastes)    │  │  Postgres)  │  │  detection) │  │  pipeline) │ │
    │  └─────────────┘  └─────────────┘  └─────────────┘  └────────────┘ │
    └──────────────────────────────────────────────────────────────────────┘

Data Flow Summary

Operation	Components Involved	Latency Target
Create paste	API → KGS → S3 → PostgreSQL → Redis	P99 < 200ms
Read (CDN hit)	CDN edge node only	P99 < 20ms
Read (cache hit)	API → Redis	P99 < 50ms
Read (cache miss)	API → Redis (miss) → PostgreSQL → S3 → Redis	P99 < 300ms
Delete paste	API → PostgreSQL → Redis → CDN purge	P99 < 100ms

Trade-Offs & Alternatives

SQL vs DynamoDB for Metadata

We chose PostgreSQL for its rich query capabilities (admin dashboards, analytics, complex filters). DynamoDB would give us automatic sharding and single-digit-ms latency but sacrifices ad-hoc queries. At our scale (116 writes/s average), PostgreSQL is more than adequate.

S3 vs Cassandra for Content

Some designs store paste content in Cassandra (key → blob). This gives sub-10ms reads (faster than S3's 50–100ms) but at much higher operational cost. With Redis caching absorbing 85%+ of reads, the S3 latency is invisible to most users.

Separate Content vs Inline

For very small pastes (< 1 KB), storing content directly in the database row (inline) would be faster and simpler. A hybrid approach — inline for < 1 KB, S3 for everything else — optimizes both cases:

if paste.size < 1024:
    metadata.inline_content = paste.content   // store in DB
    metadata.content_key = null
else:
    s3.putObject(content_key, paste.content)  // store in S3
    metadata.inline_content = null
    metadata.content_key = content_key

Encryption at Rest

All S3 content should be encrypted (SSE-S3 or SSE-KMS). For password-protected pastes, consider additional application-layer encryption using the user's password as a key derivation input (AES-256-GCM).

Eventual Consistency Implications

With read replicas and cache layers, a newly created paste might not be immediately visible:

Mitigation 1: After creating a paste, return the full paste data in the response — the client can display it immediately
Mitigation 2: Write to Redis on create (cache warming) — subsequent reads from any server hit the cache
Mitigation 3: For the creator's own requests, read from the primary database (read-your-writes consistency)

Summary

Component	Technology	Purpose
CDN	CloudFront / Cloudflare	Cache public pastes at the edge
Load Balancer	AWS ALB	Distribute traffic to API servers
API Servers	Go / Java / Node.js	Stateless request handling
Key Generation	Custom service + DB	Pre-generated unique 8-char keys
Cache	Redis Cluster	Hot paste metadata + content
Metadata DB	PostgreSQL + replicas	Paste metadata, user accounts
Content Store	Amazon S3	Paste content (gzipped)
Background Workers	Custom services	Expiration cleanup, moderation, analytics

Key takeaways: (1) Separate metadata from content — databases for small structured data, object storage for large blobs. (2) Pre-generate keys to avoid runtime collision handling. (3) Cache aggressively — CDN for public pastes, Redis for hot content. (4) Use lifecycle policies and compression to manage storage costs. (5) Build abuse prevention from day one — Pastebin is a high-abuse-risk service.