High Level Design Series· Data Storage · Part 1 of 70

SQL Database Internals

April 2026 · 24 min read

How SQL Databases Work Under the Hood

Every time you execute a SELECT * FROM orders WHERE user_id = 42, an extraordinary chain of events unfolds inside the database engine. Understanding this pipeline—parser, optimizer, executor, storage engine—is the difference between a system that handles 50 queries/second and one that handles 50,000. For system design, database internals are not academic trivia; they are the foundation of every scaling decision you will make.

The Query Processing Pipeline

Every SQL statement passes through four distinct phases before a single byte of data is returned:

Parser & Lexer — The raw SQL string is tokenized and parsed into an Abstract Syntax Tree (AST). Syntax errors are caught here. PostgreSQL uses a Bison-generated parser (gram.y, over 20,000 lines). MySQL uses a similar yacc-based parser.
Analyzer / Rewriter — The AST is validated against the catalog (do these tables exist? do column types match?). Views are expanded, CTEs are inlined or materialized, and rule-based rewrites are applied. PostgreSQL's rewriter handles things like CREATE RULE and updatable views.
Query Optimizer — The most complex phase. The optimizer generates multiple candidate execution plans, estimates cost for each using statistics, and picks the cheapest. PostgreSQL's optimizer is cost-based and considers factors like sequential I/O cost (seq_page_cost = 1.0), random I/O cost (random_page_cost = 4.0), CPU tuple processing cost, and join ordering. For a query joining N tables, there are N! possible join orderings—the optimizer uses dynamic programming (for ≤ 12 tables) or genetic algorithms (geqo, for > 12 tables) to find the best plan.
Executor & Storage Engine — The chosen plan is executed node-by-node in a pull-based (Volcano) iterator model. Each node (SeqScan, IndexScan, HashJoin, Sort, etc.) implements Open(), Next(), and Close(). The executor calls Next() on the root node, which recursively calls its children, propagating down to the storage engine that actually reads pages from disk (or the buffer pool).

# PostgreSQL: Trace the full lifecycle of a query
SET log_statement = 'all';
SET log_parser_stats = on;
SET log_planner_stats = on;
SET log_executor_stats = on;

-- Now run your query and check pg_log for timing at each phase
SELECT * FROM orders WHERE user_id = 42 AND status = 'shipped';

Why this matters for system design: When you choose between SQL and NoSQL, or decide whether to add an index, or pick between a read replica and a cache—you're making decisions that interact directly with this pipeline. A missing index doesn't just "make a query slow"; it forces the executor to do a sequential scan, reading every 8KB page of a potentially 100GB table from disk.

~0.1 ms

Parse + Analyze time

~1-50 ms

Optimizer planning

Variable

Execution time

8 KB

Default page size (PG)

PostgreSQL vs MySQL Architecture

Though both are relational, their architectures differ significantly:

Component	PostgreSQL	MySQL (InnoDB)
Process Model	Multi-process: one OS process per connection (forked from `postmaster`)	Multi-threaded: one thread per connection within a single process
Storage Engine	Single integrated engine (heap storage)	Pluggable: InnoDB (default), MyISAM, Memory, etc.
MVCC Strategy	Tuple versioning in heap (xmin/xmax)	Undo log for old row versions
Default Page Size	8 KB	16 KB
WAL Name	WAL (Write-Ahead Log)	Redo Log + Binary Log
Optimizer	Cost-based, supports hash joins, merge joins, parallel query	Cost-based, historically weaker (no hash joins until 8.0.18)
Replication	Streaming replication (WAL shipping)	Binary log replication (row-based, statement-based, or mixed)

B+ Tree Indexes

The B+ tree is the single most important data structure in relational databases. Every primary key, every unique constraint, every index you create—it's almost certainly a B+ tree. Understanding this structure is understanding why some queries are instant and others bring your database to its knees.

Anatomy of a B+ Tree

A B+ tree is a self-balancing, sorted tree structure optimized for systems that read and write large blocks of data (like disk pages):

Internal nodes contain only keys and pointers to child nodes. They act as a routing directory. A typical internal node in PostgreSQL (8KB page, 8-byte integer keys) can hold ~500 keys, meaning a single node can route to 500+ children.
Leaf nodes contain keys and either the actual row data (clustered index) or pointers to the heap (non-clustered index). All leaf nodes are at the same depth, guaranteeing O(log_B N) lookup where B is the branching factor.
Leaf-to-leaf pointers — All leaf nodes are linked in a doubly-linked list. This makes range scans (WHERE price BETWEEN 10 AND 50) extremely efficient: find the start leaf, then follow the chain.

The math behind B+ tree efficiency: With a branching factor of 500 and 8KB pages:
• Level 0 (root): 1 page = 500 keys
• Level 1: 500 pages = 250,000 keys
• Level 2: 250,000 pages = 125,000,000 keys
• Level 3: 62,500,000,000 keys

A table with 125 million rows requires only 3 disk I/Os to find any row by index. The root and level-1 nodes typically stay in the buffer pool (RAM), so in practice it's 1–2 disk reads. This is why B+ trees dominate: they minimize the one thing that's slow—disk I/O.

▶ B+ Tree Lookup

Step through a B+ tree search for key 42. Watch how the search navigates from root to leaf, counting disk I/Os at each level.

Clustered vs Non-Clustered Indexes

Property	Clustered Index	Non-Clustered (Secondary) Index
Leaf contains	The actual row data	A pointer back to the clustered index key (InnoDB) or heap tuple ID (PostgreSQL)
Physical order	Table rows are physically stored in index order	Separate structure; table rows remain in insertion order
Count per table	Exactly one (MySQL InnoDB always has one; PostgreSQL can use `CLUSTER` but doesn't maintain it)	Unlimited
Range scan cost	Sequential I/O (fast)—pages are physically adjacent	Random I/O (slow)—each row pointer may reference a different heap page
Lookup cost	Single B+ tree traversal	B+ tree traversal + "bookmark lookup" to fetch the full row

-- MySQL InnoDB: The PRIMARY KEY IS the clustered index
CREATE TABLE orders (
    id         BIGINT PRIMARY KEY,   -- Clustered index (rows stored in id order)
    user_id    BIGINT NOT NULL,
    status     VARCHAR(20),
    total      DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_user_id (user_id),            -- Secondary (non-clustered)
    INDEX idx_status_created (status, created_at)  -- Composite secondary
);

-- PostgreSQL: Heap storage by default, no persistent clustering
CREATE TABLE orders (
    id         BIGSERIAL PRIMARY KEY,
    user_id    BIGINT NOT NULL,
    status     VARCHAR(20),
    total      NUMERIC(10,2),
    created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_user_id ON orders (user_id);
CREATE INDEX idx_status_created ON orders (status, created_at);

-- One-time physical reorder (not maintained on inserts!)
CLUSTER orders USING idx_user_id;

Composite Indexes & The Leftmost Prefix Rule

A composite index on (a, b, c) is sorted first by a, then by b within each a group, then by c. This means the index can efficiently serve queries that filter on:

WHERE a = ? — ✓ Uses the index
WHERE a = ? AND b = ? — ✓ Uses the index
WHERE a = ? AND b = ? AND c = ? — ✓ Full index usage
WHERE a = ? AND c = ? — ⚠ Only the a part is used; c requires a filter after
WHERE b = ? — ✗ Cannot use the index at all (skips the leftmost prefix)
WHERE b = ? AND c = ? — ✗ Same—no leftmost prefix

-- This composite index:
CREATE INDEX idx_composite ON orders (status, created_at, total);

-- ✓ Can serve:
EXPLAIN SELECT * FROM orders WHERE status = 'shipped' AND created_at > '2025-01-01';
-- → Index Range Scan on idx_composite

-- ✗ Cannot efficiently serve:
EXPLAIN SELECT * FROM orders WHERE created_at > '2025-01-01';
-- → Sequential Scan (status is the leftmost prefix, and it's missing)

Index-Only Scans (Covering Indexes)

If all columns referenced in a query (SELECT list, WHERE, ORDER BY) are contained in the index, the database can answer the query entirely from the index without ever touching the table heap. This eliminates the "bookmark lookup" and can be 10–100x faster for certain queries.

-- PostgreSQL: INCLUDE clause for covering indexes (v11+)
CREATE INDEX idx_covering ON orders (user_id) INCLUDE (status, total);

-- This query is now index-only:
EXPLAIN ANALYZE SELECT status, total FROM orders WHERE user_id = 42;
-- → Index Only Scan using idx_covering on orders
--   Heap Fetches: 0  ← No heap access needed!

-- MySQL: Covering index via composite
CREATE INDEX idx_covering ON orders (user_id, status, total);
-- Same query can use "Using index" optimization (visible in EXPLAIN Extra column)

Performance tip: In PostgreSQL, index-only scans depend on the visibility map. If many rows on a page have been recently updated but not vacuumed, the database must still check the heap to verify visibility. Run VACUUM regularly (or rely on autovacuum) to keep index-only scans fast. Check Heap Fetches in EXPLAIN ANALYZE—if it's high, your visibility map is stale.

Buffer Pool / Page Cache

The buffer pool is the single most important performance component of any SQL database. It is a region of shared memory that caches data pages read from disk, sitting between the executor and the disk. Almost every query optimization ultimately comes down to one thing: keeping the data you need in the buffer pool.

Pages: The Unit of I/O

Databases do not read individual rows from disk. They read pages (also called blocks)—fixed-size chunks that are the smallest unit of I/O:

8 KB

PostgreSQL page size

16 KB

InnoDB page size

~0.1 ms

SSD random read (1 page)

~5-10 ms

HDD random read (1 page)

A single 8KB PostgreSQL page holding orders rows (each ~200 bytes) contains approximately 40 rows. Reading one row means reading 40 rows—this is why sequential access patterns (reading adjacent rows) are so much faster than random access. The buffer pool exploits this: once a page is read for one row, the other 39 rows are "free."

How the Buffer Pool Works

Page request — The executor needs page #1234 of the orders table.
Buffer pool lookup — A hash table maps (table OID, page number) → buffer pool slot. If found → buffer hit (instant, ~100ns). If not → buffer miss.
Eviction — If the pool is full, an eviction algorithm selects a victim page. PostgreSQL uses a clock-sweep algorithm (a variant of LRU with usage counters). MySQL InnoDB uses a modified LRU with a "young" and "old" sublist (the midpoint insertion strategy) to prevent full-table scans from flushing the entire cache.
Disk read — The page is read from the OS file into the freed buffer slot.
Pin & return — The page is "pinned" (can't be evicted) while the executor uses it, then "unpinned" when done.

-- PostgreSQL: Check buffer pool (shared_buffers) hit rate
SELECT
    datname,
    blks_hit,
    blks_read,
    ROUND(blks_hit::numeric / NULLIF(blks_hit + blks_read, 0) * 100, 2) AS hit_ratio_pct
FROM pg_stat_database
WHERE datname = current_database();

-- Target: > 99% hit ratio. Below 95% = serious problem.
-- Example output:
--  datname  | blks_hit  | blks_read | hit_ratio_pct
-- ----------+-----------+-----------+---------------
--  mydb     | 987654321 |   1234567 |         99.87

-- PostgreSQL: Configure buffer pool size
-- postgresql.conf
-- shared_buffers = 8GB   # 25% of total RAM as starting point
-- effective_cache_size = 24GB  # 75% of RAM (tells optimizer about OS cache too)

-- MySQL InnoDB: Check buffer pool stats
SHOW ENGINE INNODB STATUS\G
-- Or query the metrics:
SELECT
    pool_id,
    pool_size * 16 / 1024 AS pool_size_mb,
    free_buffers,
    database_pages,
    old_database_pages,
    modified_db_pages AS dirty_pages,
    ROUND(1 - (innodb_buffer_pool_reads / innodb_buffer_pool_read_requests) * 100, 2) AS hit_ratio
FROM information_schema.innodb_buffer_pool_stats,
     (SELECT variable_value AS innodb_buffer_pool_reads FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_reads') r,
     (SELECT variable_value AS innodb_buffer_pool_read_requests FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_read_requests') rr;

Dirty Pages and Flushing

When a write operation modifies a page in the buffer pool, that page becomes dirty—the in-memory version differs from the on-disk version. Dirty pages must eventually be flushed (written) to disk. The critical question is when:

Background flushing — PostgreSQL's bgwriter process continuously writes dirty pages to disk in the background, smoothing out I/O spikes. MySQL's page_cleaner threads do the same.
Checkpoint flushing — Periodically (or when the WAL reaches a size threshold), a checkpoint writes all dirty pages to disk. This is the most I/O-intensive operation and can cause latency spikes if misconfigured.
Eviction flushing — If the only available victim page is dirty, it must be flushed before the slot can be reused. This is the "worst case" and causes synchronous I/O stalls.

Buffer pool sizing rule of thumb:
• PostgreSQL: Set shared_buffers to 25% of total RAM (the OS filesystem cache handles the rest). Set effective_cache_size to 75% of RAM.
• MySQL InnoDB: Set innodb_buffer_pool_size to 70–80% of total RAM (InnoDB manages its own I/O, bypassing the OS cache with O_DIRECT).
• On a 64GB server: PG → shared_buffers = 16GB, MySQL → innodb_buffer_pool_size = 48GB.

Write-Ahead Log (WAL)

The Write-Ahead Log is the mechanism that makes databases durable—the D in ACID. The rule is brutally simple: before any change is applied to data pages, the description of that change must first be written to the WAL and flushed to stable storage. This single rule is why your data survives power outages, kernel panics, and disk failures.

Why WAL Exists: The Problem It Solves

Without WAL, a crash during a write could leave the database in a half-modified, inconsistent state. Consider updating an account balance from $100 to $50 and transferring $50 to another account. If the system crashes after the first write but before the second, $50 has vanished. WAL prevents this by recording the entire transaction atomically in a sequential log before touching any data pages.

How WAL Works: Step by Step

Transaction begins — BEGIN; UPDATE accounts SET balance = balance - 50 WHERE id = 1;
Generate WAL record — The database creates a log record describing the change: "Page 42, offset 128, old value = 100, new value = 50."
Write WAL to disk — The record is appended to the WAL file (pg_wal/ in PostgreSQL, redo log in MySQL). This is a sequential write—appending to the end of a file—which is extremely fast even on HDDs (~100MB/s).
Modify buffer pool page — The data page in memory is updated. It becomes a dirty page.
COMMIT → fsync WAL — On COMMIT, the WAL buffer is flushed to disk with fsync(). Once this returns, the transaction is durable. The data pages may still only exist in memory.
Acknowledge client — "COMMIT OK" is sent back. The client can be confident the data is safe.
Later: Checkpoint — Eventually, the dirty data pages are flushed to their actual locations on disk (random writes). Once a checkpoint completes, the WAL records before that point can be recycled.

▶ WAL Write Flow

Step through the write path: client write → WAL → buffer pool → ack → checkpoint → crash recovery.

Sequential vs Random I/O: Why WAL Is Fast

I/O Pattern	HDD	SSD	NVMe
Sequential write (WAL)	100–200 MB/s	400–600 MB/s	2–5 GB/s
Random write (data pages)	1–2 MB/s	200–400 MB/s	1–3 GB/s
Speed ratio	~100x faster	~2x faster	~2x faster

On HDDs, WAL provides a 100x performance improvement because sequential writes avoid seek time. On modern NVMe SSDs, the gap is smaller, but WAL still provides the crucial durability guarantee—you can't safely skip it.

Checkpointing

Checkpointing writes all dirty pages to disk and records "everything up to WAL position X is safely on disk." This serves two purposes:

Limits recovery time — After a crash, the database only needs to replay WAL records after the last checkpoint. Without checkpoints, recovery could mean replaying days of WAL.
Reclaims WAL space — WAL segments before the checkpoint are no longer needed for recovery and can be recycled.

-- PostgreSQL: Checkpoint tuning
checkpoint_timeout = 15min        -- Time between checkpoints (default 5min, increase for write-heavy)
max_wal_size = 4GB                -- WAL size that triggers a checkpoint
checkpoint_completion_target = 0.9 -- Spread I/O over 90% of checkpoint interval

-- MySQL InnoDB: Redo log tuning
innodb_log_file_size = 2G         -- Size of each redo log file
innodb_log_files_in_group = 2     -- Number of redo log files (total = 4GB)
innodb_flush_log_at_trx_commit = 1 -- 1 = flush on every commit (durable)
                                    -- 2 = flush every second (faster, risk of 1s data loss)
                                    -- 0 = write to log buffer only (fastest, risky)

WAL Archiving & Replication

WAL is not just for crash recovery. It is the foundation of database replication:

PostgreSQL streaming replication — The primary streams WAL records to standby servers in near-real-time. The standby replays the WAL to maintain an identical copy. Synchronous replication waits for the standby to confirm before committing.
Point-in-time recovery (PITR) — By archiving WAL segments, you can restore a database to any point in time. Take a base backup, then replay WAL up to the desired timestamp.
MySQL binary log — MySQL uses a separate binary log (binlog) for replication, distinct from the InnoDB redo log. The binlog records logical changes (row images or SQL statements) and is shipped to replicas.

MVCC (Multi-Version Concurrency Control)

MVCC is the mechanism that allows multiple transactions to read and write data concurrently without blocking each other. Instead of locking rows during reads, MVCC gives each transaction a consistent snapshot of the database at the moment it started. Readers never block writers. Writers never block readers. This is the secret to high concurrency in modern SQL databases.

The Core Idea

Without MVCC, a SELECT query on a row being updated by another transaction would have to either (a) block and wait for the write to complete, or (b) read uncommitted ("dirty") data. MVCC provides a third option: read the old version of the row that was valid at the start of your transaction. Each row can have multiple versions simultaneously, each visible to different transactions based on their start time.

PostgreSQL: Tuple Versioning (xmin / xmax)

PostgreSQL stores all row versions directly in the main table heap. Every tuple has two hidden system columns:

xmin — The transaction ID (XID) that created this row version.
xmax — The transaction ID that deleted or updated this row version (0 if the row is still "live").

-- See the hidden system columns
SELECT xmin, xmax, ctid, * FROM orders WHERE id = 42;
--  xmin  | xmax | ctid  | id | user_id | status  | total
-- -------+------+-------+----+---------+---------+-------
--  12345 |    0 | (0,1) | 42 |     100 | pending | 99.99

-- Now another transaction updates this row:
-- Transaction B: UPDATE orders SET status = 'shipped' WHERE id = 42;
-- This creates a NEW tuple with xmin = B's XID,
-- and sets xmax = B's XID on the OLD tuple:
--  xmin  |  xmax | ctid  | id | user_id | status  | total    ← OLD (invisible to new txns)
-- -------+-------+-------+----+---------+---------+-------
--  12345 | 12350 | (0,1) | 42 |     100 | pending | 99.99
--
--  xmin  |  xmax | ctid  | id | user_id | status  | total    ← NEW (visible after B commits)
-- -------+-------+-------+----+---------+---------+-------
--  12350 |     0 | (0,5) | 42 |     100 | shipped | 99.99

A transaction sees a row version if: xmin is a committed transaction that started before the reader, AND (xmax is 0 or xmax is an uncommitted transaction or a transaction that started after the reader).

MySQL InnoDB: Undo Log Approach

InnoDB takes a different approach. The table always contains the latest version of each row. When a row is updated, the old version is written to the undo log (also called the rollback segment). A chain of undo records lets the database reconstruct any previous version of a row:

-- InnoDB row structure (conceptual):
-- [Row Data (current)] → [Undo Record 1 (previous)] → [Undo Record 2 (older)] → ...
--
-- Each undo record contains:
-- - The old column values
-- - A pointer to the next older undo record
-- - The transaction ID that created it
--
-- To read a snapshot from 5 minutes ago, InnoDB follows the undo chain
-- backwards until it finds a version created before the snapshot timestamp.

-- Check undo log size (indicates long-running transactions):
SELECT count AS undo_log_entries
FROM information_schema.innodb_metrics
WHERE name = 'trx_rseg_history_len';
-- If this grows large (>1M), you have long-running transactions preventing purge.

VACUUM: The PostgreSQL Cleanup Problem

Because PostgreSQL stores old row versions in the main heap, dead tuples accumulate and cause table bloat. The VACUUM process reclaims this space:

Regular VACUUM — Marks dead tuples as reusable space but doesn't shrink the file. Non-blocking (mostly).
VACUUM FULL — Rewrites the entire table, reclaiming all space. Requires an ACCESS EXCLUSIVE lock—the table is completely unavailable. Avoid in production.
Autovacuum — A background daemon that automatically vacuums tables when dead tuple count exceeds a threshold (autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples).

-- Monitor table bloat
SELECT
    schemaname, relname,
    n_live_tup,
    n_dead_tup,
    ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

-- Tune autovacuum for high-write tables
ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.01,   -- Vacuum when 1% rows are dead (default 20%)
    autovacuum_analyze_scale_factor = 0.005,  -- Re-analyze stats more often
    autovacuum_vacuum_cost_delay = 2          -- Be more aggressive (default 20ms delay)
);

MySQL InnoDB has an analogous problem: the purge thread removes old undo log entries once no active transaction needs them. If a long-running transaction holds a read view open, the undo log grows unboundedly, causing performance degradation.

MVCC concurrency comparison:
• PostgreSQL: Dead tuples in heap → table bloat → needs VACUUM. Old versions are fast to read (same table). Visibility check per-tuple is cheap (compare XIDs).
• MySQL InnoDB: Clean heap (latest only) → no bloat. Old versions in undo log → can be slow to reconstruct if the undo chain is long. Purge is cheaper than VACUUM.
• Both achieve the same goal: readers never block writers, writers never block readers.

Query Optimizer

The query optimizer is the "brain" of the database. It takes a declarative SQL statement ("give me all shipped orders for user 42") and figures out the fastest way to retrieve that data. A good optimizer can make the difference between a query that takes 2 milliseconds and one that takes 20 minutes.

Cost-Based Optimization

Modern optimizers are cost-based: they estimate the "cost" of each possible execution plan and pick the cheapest. Cost is typically measured in abstract units representing disk I/O and CPU work:

-- PostgreSQL cost model parameters
SET seq_page_cost = 1.0;       -- Cost to read a page sequentially (baseline)
SET random_page_cost = 4.0;    -- Cost to read a random page (4x sequential on HDD, set to 1.1 for SSD)
SET cpu_tuple_cost = 0.01;     -- Cost to process a tuple in the executor
SET cpu_index_tuple_cost = 0.005;  -- Cost to process an index entry
SET cpu_operator_cost = 0.0025;    -- Cost to apply an operator/function

-- CRITICAL: On SSD storage, reduce random_page_cost!
-- Default 4.0 is tuned for HDD. On SSD, set to 1.1-1.5
SET random_page_cost = 1.1;  -- SSD: random reads are almost as fast as sequential

Statistics: The Optimizer's Knowledge Base

The optimizer can only make good decisions if it has accurate statistics about the data. PostgreSQL's ANALYZE (and autovacuum's auto-analyze) collects:

Table size — Number of rows (reltuples) and number of pages (relpages)
Column statistics — Most common values (MCV), histogram of value distribution, null fraction, distinct value count
Correlation — How closely the physical row order matches the index order (affects index scan cost estimates)

-- View optimizer statistics for a column
SELECT
    attname,
    null_frac,
    n_distinct,
    most_common_vals,
    most_common_freqs,
    correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
-- Example:
--  attname | null_frac | n_distinct | most_common_vals          | most_common_freqs        | correlation
-- ---------+-----------+------------+---------------------------+--------------------------+-------------
--  status  |         0 |          4 | {shipped,pending,cancelled,returned} | {0.45,0.30,0.15,0.10} |    0.12

-- Force statistics refresh
ANALYZE orders;

-- Increase statistics granularity for important columns
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;  -- Default is 100
ANALYZE orders;

Join Algorithms

Algorithm	How It Works	Best When	Cost
Nested Loop	For each row in the outer table, scan the inner table for matching rows. With an index on the inner table, each lookup is O(log N).	Small outer table + indexed inner table. Excellent for `LIMIT` queries.	O(N × M) without index O(N × log M) with index
Hash Join	Build a hash table from the smaller table (build phase), then probe it for each row of the larger table (probe phase).	Equi-joins on large tables with no useful index. Medium-to-large result sets.	O(N + M) but needs memory for the hash table
Merge Join	Sort both tables on the join key, then merge them in a single pass (like merge sort's merge step).	Both inputs are already sorted (e.g., from index scans) or when the result is needed in sorted order.	O(N log N + M log M) for sorting + O(N + M) for merge

EXPLAIN: Reading Query Plans

The EXPLAIN command is your window into the optimizer's decision. EXPLAIN ANALYZE actually executes the query and shows real timing:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, u.name
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status = 'shipped' AND o.created_at > '2025-01-01';

                                          QUERY PLAN
--------------------------------------------------------------------------------------------
 Hash Join  (cost=45.20..1234.56 rows=4500 width=48)
            (actual time=0.8..12.3 rows=4523 loops=1)
   Hash Cond: (o.user_id = u.id)
   Buffers: shared hit=342 read=12
   →  Index Scan using idx_status_created on orders o
         (cost=0.42..1120.00 rows=4500 width=28)
         (actual time=0.05..8.1 rows=4523 loops=1)
         Index Cond: ((status = 'shipped') AND (created_at > '2025-01-01'))
         Buffers: shared hit=310 read=12
   →  Hash  (cost=35.00..35.00 rows=1000 width=24)
            (actual time=0.6..0.6 rows=1000 loops=1)
         Buffers: shared hit=32
         →  Seq Scan on users u
               (cost=0.00..35.00 rows=1000 width=24)
               (actual time=0.01..0.3 rows=1000 loops=1)
               Buffers: shared hit=32
 Planning Time: 0.15 ms
 Execution Time: 12.8 ms

How to read this:

cost=start..total — Estimated cost in abstract units. start is cost before the first row is returned; total is cost for all rows.
rows — Estimated (in EXPLAIN) or actual (in EXPLAIN ANALYZE) row count. Large discrepancies indicate stale statistics.
Buffers: shared hit/read — Buffer pool hits (free) vs disk reads (expensive). read=12 means 12 pages fetched from disk.
actual time — Real wall-clock time per loop iteration, in milliseconds.

Common Optimization Pitfalls

-- ✗ BAD: Function on indexed column prevents index usage
SELECT * FROM orders WHERE YEAR(created_at) = 2025;
-- → Full table scan! The index is on created_at, not YEAR(created_at).

-- ✓ GOOD: Rewrite as a range condition
SELECT * FROM orders WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01';
-- → Index range scan on idx_created_at

-- ✗ BAD: Implicit type cast
SELECT * FROM orders WHERE id = '42';  -- id is BIGINT, '42' is text
-- Some databases can't use the index due to the type mismatch.

-- ✓ GOOD: Match the type
SELECT * FROM orders WHERE id = 42;

-- ✗ BAD: SELECT * fetches all columns, preventing index-only scan
SELECT * FROM orders WHERE user_id = 42;

-- ✓ GOOD: Select only needed columns
SELECT id, status, total FROM orders WHERE user_id = 42;

-- ✗ BAD: OR conditions can prevent index usage
SELECT * FROM orders WHERE user_id = 42 OR status = 'shipped';
-- May result in a full table scan if optimizer can't split into two index scans

-- ✓ GOOD: Use UNION ALL for separate index scans
SELECT * FROM orders WHERE user_id = 42
UNION ALL
SELECT * FROM orders WHERE status = 'shipped' AND user_id != 42;

Connection Pooling

Database connections are shockingly expensive. Each connection consumes significant server resources, and the overhead scales poorly. In PostgreSQL, each connection spawns a separate OS process (via fork()), consuming 5–10 MB of RAM. In MySQL, each connection is a thread consuming 1–5 MB. Beyond memory, each connection requires authentication, TLS negotiation, catalog caching, and session state.

5-10 MB

RAM per PostgreSQL connection

1-5 MB

RAM per MySQL connection

~50 ms

Connection setup time (TLS)

200-500

Practical max connections

Why Too Many Connections Kill Performance

Counterintuitively, increasing max_connections from 200 to 2000 will make your database slower, not faster. Here's why:

Context switching — With 2000 processes/threads, the CPU spends more time switching between them than doing actual work. On a 16-core server, you might have 125 processes per core, each getting a tiny time slice.
Lock contention — More concurrent transactions means more lock conflicts, more deadlocks, and more time waiting. PostgreSQL's shared buffer pool has internal lock partitions that become bottlenecks beyond ~300 connections.
Memory pressure — 2000 connections × 10 MB = 20 GB just for connection overhead, leaving less memory for the buffer pool. This leads to more disk I/O, which is slower for everyone.
Work_mem multiplication — PostgreSQL's work_mem (used for sorts, hash joins) is per-operation per-connection. With work_mem = 64MB and 2000 connections each running a complex query, you could need 128 GB just for work_mem.

The Solution: Connection Poolers

Pooler	Database	Mode	Key Features
PgBouncer	PostgreSQL	External proxy	Ultra-lightweight (2KB per connection). Three pooling modes: session, transaction, statement. Handles 10,000+ client connections with ~50 server connections.
Pgpool-II	PostgreSQL	External proxy	Connection pooling + load balancing + read/write splitting. More features but heavier than PgBouncer.
ProxySQL	MySQL	External proxy	Connection pooling + query routing + read/write splitting + query caching. Production-grade for MySQL deployments.
MySQL Thread Pool	MySQL	Built-in (Enterprise)	Groups connections into thread groups, limits concurrent execution. Available in MySQL Enterprise and MariaDB.

# PgBouncer configuration (pgbouncer.ini)
[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction     # Return connection to pool after each transaction
                            # (not session — session mode defeats the purpose)
max_client_conn = 10000     # Accept up to 10K client connections
default_pool_size = 25      # Only 25 actual PostgreSQL connections per database
reserve_pool_size = 5       # 5 extra connections for burst traffic
reserve_pool_timeout = 3    # Wait 3s before using reserve connections

# ProxySQL configuration
# mysql -u admin -padmin -h 127.0.0.1 -P6032
INSERT INTO mysql_servers (hostgroup_id, hostname, port, weight) VALUES
    (0, '10.0.1.10', 3306, 1000),  -- Primary (writes)
    (1, '10.0.1.11', 3306, 500),   -- Replica 1 (reads)
    (1, '10.0.1.12', 3306, 500);   -- Replica 2 (reads)

-- Route reads to replicas, writes to primary
INSERT INTO mysql_query_rules (rule_id, match_pattern, destination_hostgroup) VALUES
    (1, '^SELECT.*FOR UPDATE', 0),   -- SELECT FOR UPDATE → primary
    (2, '^SELECT', 1),               -- All other SELECTs → replicas
    (3, '.*', 0);                    -- Everything else → primary

Pool Sizing Formula

The PostgreSQL wiki's formula:
connections = (CPU cores × 2) + effective_spindle_count

On a 16-core server with 1 SSD: (16 × 2) + 1 = 33 connections.
On a 32-core server with 4 SSDs: (32 × 2) + 4 = 68 connections.

This seems shockingly low, but benchmarks consistently show that a pool of 25–50 connections outperforms 200+ connections on typical hardware. The reason: fewer connections = less contention = each query finishes faster = higher overall throughput. This is well-documented by HikariCP.

Storage Engines

The storage engine is the lowest layer of the database: it determines how data is physically organized on disk, how reads and writes are performed, and what durability and concurrency guarantees are provided. Different engines make fundamentally different trade-offs.

PostgreSQL: Heap Storage

PostgreSQL uses a straightforward heap storage model:

The table is stored as a sequence of 8KB pages in a file (one file per 1GB segment).
Rows are inserted wherever there's free space (tracked by the Free Space Map, or FSM).
Rows are not stored in any particular order (unless you explicitly run CLUSTER, which is a one-time operation).
Each row version includes a header (23 bytes for system columns: xmin, xmax, ctid, etc.) plus the actual data.
Deleted/updated rows leave dead tuples that must be reclaimed by VACUUM.

TOAST (The Oversized-Attribute Storage Technique): When a row exceeds ~2KB, PostgreSQL automatically compresses and/or moves large column values to a separate TOAST table. This keeps the main table pages efficient for scanning.

-- Check if a table has a TOAST table
SELECT relname, reltoastrelid::regclass AS toast_table
FROM pg_class WHERE relname = 'documents';
--  relname   | toast_table
-- -----------+---------------------------
--  documents | pg_toast.pg_toast_16384

-- Check TOAST table size
SELECT pg_size_pretty(pg_total_relation_size('pg_toast.pg_toast_16384'));

-- Monitor table size and bloat
SELECT
    relname,
    pg_size_pretty(pg_relation_size(relid)) AS table_size,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
    pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) AS index_and_toast_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;

MySQL: InnoDB vs MyISAM

Feature	InnoDB	MyISAM
ACID compliance	Full (transactions, crash recovery)	No transactions, no crash recovery
Locking	Row-level locking via MVCC	Table-level locking
Index structure	Clustered B+ tree (PK = data). Secondary indexes store PK values.	B+ tree with pointers to physical row offsets
Foreign keys	Supported	Not supported
Crash recovery	Redo log + doublewrite buffer → automatic recovery	No recovery. `REPAIR TABLE` may lose data.
Full-text search	Supported (since 5.6)	Supported (historically the main reason to use MyISAM)
Compression	Page-level compression, transparent	`myisampack` for read-only compressed tables
Use case (2025)	Everything. Default since MySQL 5.5.	Legacy only. No reason to use in new projects.

InnoDB secondary index "double lookup": In InnoDB, a secondary index leaf contains the primary key value, not a row pointer. So a secondary index lookup requires two B+ tree traversals: (1) traverse the secondary index to find the PK, then (2) traverse the clustered (primary key) index to find the actual row. This is why InnoDB queries are faster when you can use the primary key directly, and why covering indexes (which avoid the second lookup) are especially valuable.

B-Tree vs LSM-Tree Engines

Beyond the traditional B+ tree engines (PostgreSQL, InnoDB), a fundamentally different approach exists: the Log-Structured Merge (LSM) tree, used by RocksDB, LevelDB, Cassandra, and HBase.

Property	B+ Tree (InnoDB, PostgreSQL)	LSM Tree (RocksDB, Cassandra)
Write pattern	In-place updates: find the page, modify it	Append-only: writes go to a memtable, then flushed as immutable sorted runs (SSTables)
Write amplification	Lower for small updates (modify one page)	Higher due to compaction (data rewritten during merge)
Read amplification	Lower: single B+ tree traversal	Higher: may check memtable + multiple SSTable levels. Bloom filters mitigate.
Space amplification	Higher (in-place updates, page fragmentation, MVCC bloat)	Can be lower after compaction (but temp space during compaction)
Write throughput	Good (limited by random I/O for page writes)	Excellent (all writes are sequential)
Read latency	Predictable (O(log N) always)	Variable (depends on compaction state, may need to check multiple levels)
Best for	Read-heavy OLTP, complex queries, transactions	Write-heavy workloads, time-series, logging, ingestion pipelines

-- MySQL with RocksDB engine (MyRocks - used by Facebook for MySQL):
CREATE TABLE events (
    id BIGINT PRIMARY KEY,
    event_type VARCHAR(50),
    payload JSON,
    created_at TIMESTAMP
) ENGINE=ROCKSDB;

-- Advantages over InnoDB for Facebook's use case:
-- - 50% less storage (better compression on sorted data)
-- - Higher write throughput for their workload
-- - Lower write amplification

-- PostgreSQL doesn't support pluggable engines natively,
-- but extensions like pg_tde and columnar engines (citus_columnar) exist.

Choosing the Right Storage Approach

OLTP (orders, users, payments) → B+ tree engine (InnoDB, PostgreSQL). Predictable reads, ACID transactions, row-level locking.
Write-heavy ingestion (logs, events, metrics) → LSM engine (RocksDB, Cassandra). Sequential writes, better compression, higher write throughput.
Analytics (reporting, BI) → Columnar storage (ClickHouse, Redshift, BigQuery). Scans only needed columns, extreme compression ratios.
Mixed workloads → Consider HTAP databases (TiDB, CockroachDB) that combine row and columnar storage.