SQL Database Internals
How SQL Databases Work Under the Hood
Every time you execute a SELECT * FROM orders WHERE user_id = 42, an extraordinary chain of events unfolds inside the database engine. Understanding this pipeline—parser, optimizer, executor, storage engine—is the difference between a system that handles 50 queries/second and one that handles 50,000. For system design, database internals are not academic trivia; they are the foundation of every scaling decision you will make.
The Query Processing Pipeline
Every SQL statement passes through four distinct phases before a single byte of data is returned:
- Parser & Lexer — The raw SQL string is tokenized and parsed into an Abstract Syntax Tree (AST). Syntax errors are caught here. PostgreSQL uses a Bison-generated parser (
gram.y, over 20,000 lines). MySQL uses a similar yacc-based parser. - Analyzer / Rewriter — The AST is validated against the catalog (do these tables exist? do column types match?). Views are expanded, CTEs are inlined or materialized, and rule-based rewrites are applied. PostgreSQL's rewriter handles things like
CREATE RULEand updatable views. - Query Optimizer — The most complex phase. The optimizer generates multiple candidate execution plans, estimates cost for each using statistics, and picks the cheapest. PostgreSQL's optimizer is cost-based and considers factors like sequential I/O cost (
seq_page_cost = 1.0), random I/O cost (random_page_cost = 4.0), CPU tuple processing cost, and join ordering. For a query joining N tables, there are N! possible join orderings—the optimizer uses dynamic programming (for ≤ 12 tables) or genetic algorithms (geqo, for > 12 tables) to find the best plan. - Executor & Storage Engine — The chosen plan is executed node-by-node in a pull-based (Volcano) iterator model. Each node (SeqScan, IndexScan, HashJoin, Sort, etc.) implements
Open(),Next(), andClose(). The executor callsNext()on the root node, which recursively calls its children, propagating down to the storage engine that actually reads pages from disk (or the buffer pool).
# PostgreSQL: Trace the full lifecycle of a query
SET log_statement = 'all';
SET log_parser_stats = on;
SET log_planner_stats = on;
SET log_executor_stats = on;
-- Now run your query and check pg_log for timing at each phase
SELECT * FROM orders WHERE user_id = 42 AND status = 'shipped';
PostgreSQL vs MySQL Architecture
Though both are relational, their architectures differ significantly:
| Component | PostgreSQL | MySQL (InnoDB) |
|---|---|---|
| Process Model | Multi-process: one OS process per connection (forked from postmaster) | Multi-threaded: one thread per connection within a single process |
| Storage Engine | Single integrated engine (heap storage) | Pluggable: InnoDB (default), MyISAM, Memory, etc. |
| MVCC Strategy | Tuple versioning in heap (xmin/xmax) | Undo log for old row versions |
| Default Page Size | 8 KB | 16 KB |
| WAL Name | WAL (Write-Ahead Log) | Redo Log + Binary Log |
| Optimizer | Cost-based, supports hash joins, merge joins, parallel query | Cost-based, historically weaker (no hash joins until 8.0.18) |
| Replication | Streaming replication (WAL shipping) | Binary log replication (row-based, statement-based, or mixed) |
B+ Tree Indexes
The B+ tree is the single most important data structure in relational databases. Every primary key, every unique constraint, every index you create—it's almost certainly a B+ tree. Understanding this structure is understanding why some queries are instant and others bring your database to its knees.
Anatomy of a B+ Tree
A B+ tree is a self-balancing, sorted tree structure optimized for systems that read and write large blocks of data (like disk pages):
- Internal nodes contain only keys and pointers to child nodes. They act as a routing directory. A typical internal node in PostgreSQL (8KB page, 8-byte integer keys) can hold ~500 keys, meaning a single node can route to 500+ children.
- Leaf nodes contain keys and either the actual row data (clustered index) or pointers to the heap (non-clustered index). All leaf nodes are at the same depth, guaranteeing O(logB N) lookup where B is the branching factor.
- Leaf-to-leaf pointers — All leaf nodes are linked in a doubly-linked list. This makes range scans (
WHERE price BETWEEN 10 AND 50) extremely efficient: find the start leaf, then follow the chain.
• Level 0 (root): 1 page = 500 keys
• Level 1: 500 pages = 250,000 keys
• Level 2: 250,000 pages = 125,000,000 keys
• Level 3: 62,500,000,000 keys
A table with 125 million rows requires only 3 disk I/Os to find any row by index. The root and level-1 nodes typically stay in the buffer pool (RAM), so in practice it's 1–2 disk reads. This is why B+ trees dominate: they minimize the one thing that's slow—disk I/O.
▶ B+ Tree Lookup
Step through a B+ tree search for key 42. Watch how the search navigates from root to leaf, counting disk I/Os at each level.
Clustered vs Non-Clustered Indexes
| Property | Clustered Index | Non-Clustered (Secondary) Index |
|---|---|---|
| Leaf contains | The actual row data | A pointer back to the clustered index key (InnoDB) or heap tuple ID (PostgreSQL) |
| Physical order | Table rows are physically stored in index order | Separate structure; table rows remain in insertion order |
| Count per table | Exactly one (MySQL InnoDB always has one; PostgreSQL can use CLUSTER but doesn't maintain it) | Unlimited |
| Range scan cost | Sequential I/O (fast)—pages are physically adjacent | Random I/O (slow)—each row pointer may reference a different heap page |
| Lookup cost | Single B+ tree traversal | B+ tree traversal + "bookmark lookup" to fetch the full row |
-- MySQL InnoDB: The PRIMARY KEY IS the clustered index
CREATE TABLE orders (
id BIGINT PRIMARY KEY, -- Clustered index (rows stored in id order)
user_id BIGINT NOT NULL,
status VARCHAR(20),
total DECIMAL(10,2),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_user_id (user_id), -- Secondary (non-clustered)
INDEX idx_status_created (status, created_at) -- Composite secondary
);
-- PostgreSQL: Heap storage by default, no persistent clustering
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL,
status VARCHAR(20),
total NUMERIC(10,2),
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_user_id ON orders (user_id);
CREATE INDEX idx_status_created ON orders (status, created_at);
-- One-time physical reorder (not maintained on inserts!)
CLUSTER orders USING idx_user_id;
Composite Indexes & The Leftmost Prefix Rule
A composite index on (a, b, c) is sorted first by a, then by b within each a group, then by c. This means the index can efficiently serve queries that filter on:
WHERE a = ?— ✓ Uses the indexWHERE a = ? AND b = ?— ✓ Uses the indexWHERE a = ? AND b = ? AND c = ?— ✓ Full index usageWHERE a = ? AND c = ?— ⚠ Only theapart is used;crequires a filter afterWHERE b = ?— ✗ Cannot use the index at all (skips the leftmost prefix)WHERE b = ? AND c = ?— ✗ Same—no leftmost prefix
-- This composite index:
CREATE INDEX idx_composite ON orders (status, created_at, total);
-- ✓ Can serve:
EXPLAIN SELECT * FROM orders WHERE status = 'shipped' AND created_at > '2025-01-01';
-- → Index Range Scan on idx_composite
-- ✗ Cannot efficiently serve:
EXPLAIN SELECT * FROM orders WHERE created_at > '2025-01-01';
-- → Sequential Scan (status is the leftmost prefix, and it's missing)
Index-Only Scans (Covering Indexes)
If all columns referenced in a query (SELECT list, WHERE, ORDER BY) are contained in the index, the database can answer the query entirely from the index without ever touching the table heap. This eliminates the "bookmark lookup" and can be 10–100x faster for certain queries.
-- PostgreSQL: INCLUDE clause for covering indexes (v11+)
CREATE INDEX idx_covering ON orders (user_id) INCLUDE (status, total);
-- This query is now index-only:
EXPLAIN ANALYZE SELECT status, total FROM orders WHERE user_id = 42;
-- → Index Only Scan using idx_covering on orders
-- Heap Fetches: 0 ← No heap access needed!
-- MySQL: Covering index via composite
CREATE INDEX idx_covering ON orders (user_id, status, total);
-- Same query can use "Using index" optimization (visible in EXPLAIN Extra column)
VACUUM regularly (or rely on autovacuum) to keep index-only scans fast. Check Heap Fetches in EXPLAIN ANALYZE—if it's high, your visibility map is stale.
Buffer Pool / Page Cache
The buffer pool is the single most important performance component of any SQL database. It is a region of shared memory that caches data pages read from disk, sitting between the executor and the disk. Almost every query optimization ultimately comes down to one thing: keeping the data you need in the buffer pool.
Pages: The Unit of I/O
Databases do not read individual rows from disk. They read pages (also called blocks)—fixed-size chunks that are the smallest unit of I/O:
A single 8KB PostgreSQL page holding orders rows (each ~200 bytes) contains approximately 40 rows. Reading one row means reading 40 rows—this is why sequential access patterns (reading adjacent rows) are so much faster than random access. The buffer pool exploits this: once a page is read for one row, the other 39 rows are "free."
How the Buffer Pool Works
- Page request — The executor needs page #1234 of the
orderstable. - Buffer pool lookup — A hash table maps (table OID, page number) → buffer pool slot. If found → buffer hit (instant, ~100ns). If not → buffer miss.
- Eviction — If the pool is full, an eviction algorithm selects a victim page. PostgreSQL uses a clock-sweep algorithm (a variant of LRU with usage counters). MySQL InnoDB uses a modified LRU with a "young" and "old" sublist (the midpoint insertion strategy) to prevent full-table scans from flushing the entire cache.
- Disk read — The page is read from the OS file into the freed buffer slot.
- Pin & return — The page is "pinned" (can't be evicted) while the executor uses it, then "unpinned" when done.
-- PostgreSQL: Check buffer pool (shared_buffers) hit rate
SELECT
datname,
blks_hit,
blks_read,
ROUND(blks_hit::numeric / NULLIF(blks_hit + blks_read, 0) * 100, 2) AS hit_ratio_pct
FROM pg_stat_database
WHERE datname = current_database();
-- Target: > 99% hit ratio. Below 95% = serious problem.
-- Example output:
-- datname | blks_hit | blks_read | hit_ratio_pct
-- ----------+-----------+-----------+---------------
-- mydb | 987654321 | 1234567 | 99.87
-- PostgreSQL: Configure buffer pool size
-- postgresql.conf
-- shared_buffers = 8GB # 25% of total RAM as starting point
-- effective_cache_size = 24GB # 75% of RAM (tells optimizer about OS cache too)
-- MySQL InnoDB: Check buffer pool stats
SHOW ENGINE INNODB STATUS\G
-- Or query the metrics:
SELECT
pool_id,
pool_size * 16 / 1024 AS pool_size_mb,
free_buffers,
database_pages,
old_database_pages,
modified_db_pages AS dirty_pages,
ROUND(1 - (innodb_buffer_pool_reads / innodb_buffer_pool_read_requests) * 100, 2) AS hit_ratio
FROM information_schema.innodb_buffer_pool_stats,
(SELECT variable_value AS innodb_buffer_pool_reads FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_reads') r,
(SELECT variable_value AS innodb_buffer_pool_read_requests FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_read_requests') rr;
Dirty Pages and Flushing
When a write operation modifies a page in the buffer pool, that page becomes dirty—the in-memory version differs from the on-disk version. Dirty pages must eventually be flushed (written) to disk. The critical question is when:
- Background flushing — PostgreSQL's
bgwriterprocess continuously writes dirty pages to disk in the background, smoothing out I/O spikes. MySQL'spage_cleanerthreads do the same. - Checkpoint flushing — Periodically (or when the WAL reaches a size threshold), a checkpoint writes all dirty pages to disk. This is the most I/O-intensive operation and can cause latency spikes if misconfigured.
- Eviction flushing — If the only available victim page is dirty, it must be flushed before the slot can be reused. This is the "worst case" and causes synchronous I/O stalls.
• PostgreSQL: Set
shared_buffers to 25% of total RAM (the OS filesystem cache handles the rest). Set effective_cache_size to 75% of RAM.• MySQL InnoDB: Set
innodb_buffer_pool_size to 70–80% of total RAM (InnoDB manages its own I/O, bypassing the OS cache with O_DIRECT).• On a 64GB server: PG →
shared_buffers = 16GB, MySQL → innodb_buffer_pool_size = 48GB.
Write-Ahead Log (WAL)
The Write-Ahead Log is the mechanism that makes databases durable—the D in ACID. The rule is brutally simple: before any change is applied to data pages, the description of that change must first be written to the WAL and flushed to stable storage. This single rule is why your data survives power outages, kernel panics, and disk failures.
Why WAL Exists: The Problem It Solves
Without WAL, a crash during a write could leave the database in a half-modified, inconsistent state. Consider updating an account balance from $100 to $50 and transferring $50 to another account. If the system crashes after the first write but before the second, $50 has vanished. WAL prevents this by recording the entire transaction atomically in a sequential log before touching any data pages.
How WAL Works: Step by Step
- Transaction begins —
BEGIN; UPDATE accounts SET balance = balance - 50 WHERE id = 1; - Generate WAL record — The database creates a log record describing the change: "Page 42, offset 128, old value = 100, new value = 50."
- Write WAL to disk — The record is appended to the WAL file (
pg_wal/in PostgreSQL, redo log in MySQL). This is a sequential write—appending to the end of a file—which is extremely fast even on HDDs (~100MB/s). - Modify buffer pool page — The data page in memory is updated. It becomes a dirty page.
- COMMIT → fsync WAL — On
COMMIT, the WAL buffer is flushed to disk withfsync(). Once this returns, the transaction is durable. The data pages may still only exist in memory. - Acknowledge client — "COMMIT OK" is sent back. The client can be confident the data is safe.
- Later: Checkpoint — Eventually, the dirty data pages are flushed to their actual locations on disk (random writes). Once a checkpoint completes, the WAL records before that point can be recycled.
▶ WAL Write Flow
Step through the write path: client write → WAL → buffer pool → ack → checkpoint → crash recovery.
Sequential vs Random I/O: Why WAL Is Fast
| I/O Pattern | HDD | SSD | NVMe |
|---|---|---|---|
| Sequential write (WAL) | 100–200 MB/s | 400–600 MB/s | 2–5 GB/s |
| Random write (data pages) | 1–2 MB/s | 200–400 MB/s | 1–3 GB/s |
| Speed ratio | ~100x faster | ~2x faster | ~2x faster |
On HDDs, WAL provides a 100x performance improvement because sequential writes avoid seek time. On modern NVMe SSDs, the gap is smaller, but WAL still provides the crucial durability guarantee—you can't safely skip it.
Checkpointing
Checkpointing writes all dirty pages to disk and records "everything up to WAL position X is safely on disk." This serves two purposes:
- Limits recovery time — After a crash, the database only needs to replay WAL records after the last checkpoint. Without checkpoints, recovery could mean replaying days of WAL.
- Reclaims WAL space — WAL segments before the checkpoint are no longer needed for recovery and can be recycled.
-- PostgreSQL: Checkpoint tuning
checkpoint_timeout = 15min -- Time between checkpoints (default 5min, increase for write-heavy)
max_wal_size = 4GB -- WAL size that triggers a checkpoint
checkpoint_completion_target = 0.9 -- Spread I/O over 90% of checkpoint interval
-- MySQL InnoDB: Redo log tuning
innodb_log_file_size = 2G -- Size of each redo log file
innodb_log_files_in_group = 2 -- Number of redo log files (total = 4GB)
innodb_flush_log_at_trx_commit = 1 -- 1 = flush on every commit (durable)
-- 2 = flush every second (faster, risk of 1s data loss)
-- 0 = write to log buffer only (fastest, risky)
WAL Archiving & Replication
WAL is not just for crash recovery. It is the foundation of database replication:
- PostgreSQL streaming replication — The primary streams WAL records to standby servers in near-real-time. The standby replays the WAL to maintain an identical copy. Synchronous replication waits for the standby to confirm before committing.
- Point-in-time recovery (PITR) — By archiving WAL segments, you can restore a database to any point in time. Take a base backup, then replay WAL up to the desired timestamp.
- MySQL binary log — MySQL uses a separate binary log (binlog) for replication, distinct from the InnoDB redo log. The binlog records logical changes (row images or SQL statements) and is shipped to replicas.
MVCC (Multi-Version Concurrency Control)
MVCC is the mechanism that allows multiple transactions to read and write data concurrently without blocking each other. Instead of locking rows during reads, MVCC gives each transaction a consistent snapshot of the database at the moment it started. Readers never block writers. Writers never block readers. This is the secret to high concurrency in modern SQL databases.
The Core Idea
Without MVCC, a SELECT query on a row being updated by another transaction would have to either (a) block and wait for the write to complete, or (b) read uncommitted ("dirty") data. MVCC provides a third option: read the old version of the row that was valid at the start of your transaction. Each row can have multiple versions simultaneously, each visible to different transactions based on their start time.
PostgreSQL: Tuple Versioning (xmin / xmax)
PostgreSQL stores all row versions directly in the main table heap. Every tuple has two hidden system columns:
xmin— The transaction ID (XID) that created this row version.xmax— The transaction ID that deleted or updated this row version (0 if the row is still "live").
-- See the hidden system columns
SELECT xmin, xmax, ctid, * FROM orders WHERE id = 42;
-- xmin | xmax | ctid | id | user_id | status | total
-- -------+------+-------+----+---------+---------+-------
-- 12345 | 0 | (0,1) | 42 | 100 | pending | 99.99
-- Now another transaction updates this row:
-- Transaction B: UPDATE orders SET status = 'shipped' WHERE id = 42;
-- This creates a NEW tuple with xmin = B's XID,
-- and sets xmax = B's XID on the OLD tuple:
-- xmin | xmax | ctid | id | user_id | status | total ← OLD (invisible to new txns)
-- -------+-------+-------+----+---------+---------+-------
-- 12345 | 12350 | (0,1) | 42 | 100 | pending | 99.99
--
-- xmin | xmax | ctid | id | user_id | status | total ← NEW (visible after B commits)
-- -------+-------+-------+----+---------+---------+-------
-- 12350 | 0 | (0,5) | 42 | 100 | shipped | 99.99
A transaction sees a row version if: xmin is a committed transaction that started before the reader, AND (xmax is 0 or xmax is an uncommitted transaction or a transaction that started after the reader).
MySQL InnoDB: Undo Log Approach
InnoDB takes a different approach. The table always contains the latest version of each row. When a row is updated, the old version is written to the undo log (also called the rollback segment). A chain of undo records lets the database reconstruct any previous version of a row:
-- InnoDB row structure (conceptual):
-- [Row Data (current)] → [Undo Record 1 (previous)] → [Undo Record 2 (older)] → ...
--
-- Each undo record contains:
-- - The old column values
-- - A pointer to the next older undo record
-- - The transaction ID that created it
--
-- To read a snapshot from 5 minutes ago, InnoDB follows the undo chain
-- backwards until it finds a version created before the snapshot timestamp.
-- Check undo log size (indicates long-running transactions):
SELECT count AS undo_log_entries
FROM information_schema.innodb_metrics
WHERE name = 'trx_rseg_history_len';
-- If this grows large (>1M), you have long-running transactions preventing purge.
VACUUM: The PostgreSQL Cleanup Problem
Because PostgreSQL stores old row versions in the main heap, dead tuples accumulate and cause table bloat. The VACUUM process reclaims this space:
- Regular VACUUM — Marks dead tuples as reusable space but doesn't shrink the file. Non-blocking (mostly).
- VACUUM FULL — Rewrites the entire table, reclaiming all space. Requires an
ACCESS EXCLUSIVElock—the table is completely unavailable. Avoid in production. - Autovacuum — A background daemon that automatically vacuums tables when dead tuple count exceeds a threshold (
autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples).
-- Monitor table bloat
SELECT
schemaname, relname,
n_live_tup,
n_dead_tup,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
-- Tune autovacuum for high-write tables
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.01, -- Vacuum when 1% rows are dead (default 20%)
autovacuum_analyze_scale_factor = 0.005, -- Re-analyze stats more often
autovacuum_vacuum_cost_delay = 2 -- Be more aggressive (default 20ms delay)
);
MySQL InnoDB has an analogous problem: the purge thread removes old undo log entries once no active transaction needs them. If a long-running transaction holds a read view open, the undo log grows unboundedly, causing performance degradation.
• PostgreSQL: Dead tuples in heap → table bloat → needs VACUUM. Old versions are fast to read (same table). Visibility check per-tuple is cheap (compare XIDs).
• MySQL InnoDB: Clean heap (latest only) → no bloat. Old versions in undo log → can be slow to reconstruct if the undo chain is long. Purge is cheaper than VACUUM.
• Both achieve the same goal: readers never block writers, writers never block readers.
Query Optimizer
The query optimizer is the "brain" of the database. It takes a declarative SQL statement ("give me all shipped orders for user 42") and figures out the fastest way to retrieve that data. A good optimizer can make the difference between a query that takes 2 milliseconds and one that takes 20 minutes.
Cost-Based Optimization
Modern optimizers are cost-based: they estimate the "cost" of each possible execution plan and pick the cheapest. Cost is typically measured in abstract units representing disk I/O and CPU work:
-- PostgreSQL cost model parameters
SET seq_page_cost = 1.0; -- Cost to read a page sequentially (baseline)
SET random_page_cost = 4.0; -- Cost to read a random page (4x sequential on HDD, set to 1.1 for SSD)
SET cpu_tuple_cost = 0.01; -- Cost to process a tuple in the executor
SET cpu_index_tuple_cost = 0.005; -- Cost to process an index entry
SET cpu_operator_cost = 0.0025; -- Cost to apply an operator/function
-- CRITICAL: On SSD storage, reduce random_page_cost!
-- Default 4.0 is tuned for HDD. On SSD, set to 1.1-1.5
SET random_page_cost = 1.1; -- SSD: random reads are almost as fast as sequential
Statistics: The Optimizer's Knowledge Base
The optimizer can only make good decisions if it has accurate statistics about the data. PostgreSQL's ANALYZE (and autovacuum's auto-analyze) collects:
- Table size — Number of rows (
reltuples) and number of pages (relpages) - Column statistics — Most common values (MCV), histogram of value distribution, null fraction, distinct value count
- Correlation — How closely the physical row order matches the index order (affects index scan cost estimates)
-- View optimizer statistics for a column
SELECT
attname,
null_frac,
n_distinct,
most_common_vals,
most_common_freqs,
correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
-- Example:
-- attname | null_frac | n_distinct | most_common_vals | most_common_freqs | correlation
-- ---------+-----------+------------+---------------------------+--------------------------+-------------
-- status | 0 | 4 | {shipped,pending,cancelled,returned} | {0.45,0.30,0.15,0.10} | 0.12
-- Force statistics refresh
ANALYZE orders;
-- Increase statistics granularity for important columns
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000; -- Default is 100
ANALYZE orders;
Join Algorithms
| Algorithm | How It Works | Best When | Cost |
|---|---|---|---|
| Nested Loop | For each row in the outer table, scan the inner table for matching rows. With an index on the inner table, each lookup is O(log N). | Small outer table + indexed inner table. Excellent for LIMIT queries. |
O(N × M) without index O(N × log M) with index |
| Hash Join | Build a hash table from the smaller table (build phase), then probe it for each row of the larger table (probe phase). | Equi-joins on large tables with no useful index. Medium-to-large result sets. | O(N + M) but needs memory for the hash table |
| Merge Join | Sort both tables on the join key, then merge them in a single pass (like merge sort's merge step). | Both inputs are already sorted (e.g., from index scans) or when the result is needed in sorted order. | O(N log N + M log M) for sorting + O(N + M) for merge |
EXPLAIN: Reading Query Plans
The EXPLAIN command is your window into the optimizer's decision. EXPLAIN ANALYZE actually executes the query and shows real timing:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, u.name
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status = 'shipped' AND o.created_at > '2025-01-01';
QUERY PLAN
--------------------------------------------------------------------------------------------
Hash Join (cost=45.20..1234.56 rows=4500 width=48)
(actual time=0.8..12.3 rows=4523 loops=1)
Hash Cond: (o.user_id = u.id)
Buffers: shared hit=342 read=12
→ Index Scan using idx_status_created on orders o
(cost=0.42..1120.00 rows=4500 width=28)
(actual time=0.05..8.1 rows=4523 loops=1)
Index Cond: ((status = 'shipped') AND (created_at > '2025-01-01'))
Buffers: shared hit=310 read=12
→ Hash (cost=35.00..35.00 rows=1000 width=24)
(actual time=0.6..0.6 rows=1000 loops=1)
Buffers: shared hit=32
→ Seq Scan on users u
(cost=0.00..35.00 rows=1000 width=24)
(actual time=0.01..0.3 rows=1000 loops=1)
Buffers: shared hit=32
Planning Time: 0.15 ms
Execution Time: 12.8 ms
How to read this:
- cost=start..total — Estimated cost in abstract units.
startis cost before the first row is returned;totalis cost for all rows. - rows — Estimated (in
EXPLAIN) or actual (inEXPLAIN ANALYZE) row count. Large discrepancies indicate stale statistics. - Buffers: shared hit/read — Buffer pool hits (free) vs disk reads (expensive).
read=12means 12 pages fetched from disk. - actual time — Real wall-clock time per loop iteration, in milliseconds.
Common Optimization Pitfalls
-- ✗ BAD: Function on indexed column prevents index usage
SELECT * FROM orders WHERE YEAR(created_at) = 2025;
-- → Full table scan! The index is on created_at, not YEAR(created_at).
-- ✓ GOOD: Rewrite as a range condition
SELECT * FROM orders WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01';
-- → Index range scan on idx_created_at
-- ✗ BAD: Implicit type cast
SELECT * FROM orders WHERE id = '42'; -- id is BIGINT, '42' is text
-- Some databases can't use the index due to the type mismatch.
-- ✓ GOOD: Match the type
SELECT * FROM orders WHERE id = 42;
-- ✗ BAD: SELECT * fetches all columns, preventing index-only scan
SELECT * FROM orders WHERE user_id = 42;
-- ✓ GOOD: Select only needed columns
SELECT id, status, total FROM orders WHERE user_id = 42;
-- ✗ BAD: OR conditions can prevent index usage
SELECT * FROM orders WHERE user_id = 42 OR status = 'shipped';
-- May result in a full table scan if optimizer can't split into two index scans
-- ✓ GOOD: Use UNION ALL for separate index scans
SELECT * FROM orders WHERE user_id = 42
UNION ALL
SELECT * FROM orders WHERE status = 'shipped' AND user_id != 42;
Connection Pooling
Database connections are shockingly expensive. Each connection consumes significant server resources, and the overhead scales poorly. In PostgreSQL, each connection spawns a separate OS process (via fork()), consuming 5–10 MB of RAM. In MySQL, each connection is a thread consuming 1–5 MB. Beyond memory, each connection requires authentication, TLS negotiation, catalog caching, and session state.
Why Too Many Connections Kill Performance
Counterintuitively, increasing max_connections from 200 to 2000 will make your database slower, not faster. Here's why:
- Context switching — With 2000 processes/threads, the CPU spends more time switching between them than doing actual work. On a 16-core server, you might have 125 processes per core, each getting a tiny time slice.
- Lock contention — More concurrent transactions means more lock conflicts, more deadlocks, and more time waiting. PostgreSQL's shared buffer pool has internal lock partitions that become bottlenecks beyond ~300 connections.
- Memory pressure — 2000 connections × 10 MB = 20 GB just for connection overhead, leaving less memory for the buffer pool. This leads to more disk I/O, which is slower for everyone.
- Work_mem multiplication — PostgreSQL's
work_mem(used for sorts, hash joins) is per-operation per-connection. Withwork_mem = 64MBand 2000 connections each running a complex query, you could need 128 GB just for work_mem.
The Solution: Connection Poolers
| Pooler | Database | Mode | Key Features |
|---|---|---|---|
| PgBouncer | PostgreSQL | External proxy | Ultra-lightweight (2KB per connection). Three pooling modes: session, transaction, statement. Handles 10,000+ client connections with ~50 server connections. |
| Pgpool-II | PostgreSQL | External proxy | Connection pooling + load balancing + read/write splitting. More features but heavier than PgBouncer. |
| ProxySQL | MySQL | External proxy | Connection pooling + query routing + read/write splitting + query caching. Production-grade for MySQL deployments. |
| MySQL Thread Pool | MySQL | Built-in (Enterprise) | Groups connections into thread groups, limits concurrent execution. Available in MySQL Enterprise and MariaDB. |
# PgBouncer configuration (pgbouncer.ini)
[databases]
mydb = host=localhost port=5432 dbname=mydb
[pgbouncer]
listen_port = 6432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction # Return connection to pool after each transaction
# (not session — session mode defeats the purpose)
max_client_conn = 10000 # Accept up to 10K client connections
default_pool_size = 25 # Only 25 actual PostgreSQL connections per database
reserve_pool_size = 5 # 5 extra connections for burst traffic
reserve_pool_timeout = 3 # Wait 3s before using reserve connections
# ProxySQL configuration
# mysql -u admin -padmin -h 127.0.0.1 -P6032
INSERT INTO mysql_servers (hostgroup_id, hostname, port, weight) VALUES
(0, '10.0.1.10', 3306, 1000), -- Primary (writes)
(1, '10.0.1.11', 3306, 500), -- Replica 1 (reads)
(1, '10.0.1.12', 3306, 500); -- Replica 2 (reads)
-- Route reads to replicas, writes to primary
INSERT INTO mysql_query_rules (rule_id, match_pattern, destination_hostgroup) VALUES
(1, '^SELECT.*FOR UPDATE', 0), -- SELECT FOR UPDATE → primary
(2, '^SELECT', 1), -- All other SELECTs → replicas
(3, '.*', 0); -- Everything else → primary
Pool Sizing Formula
connections = (CPU cores × 2) + effective_spindle_countOn a 16-core server with 1 SSD:
(16 × 2) + 1 = 33 connections.On a 32-core server with 4 SSDs:
(32 × 2) + 4 = 68 connections.This seems shockingly low, but benchmarks consistently show that a pool of 25–50 connections outperforms 200+ connections on typical hardware. The reason: fewer connections = less contention = each query finishes faster = higher overall throughput. This is well-documented by HikariCP.
Storage Engines
The storage engine is the lowest layer of the database: it determines how data is physically organized on disk, how reads and writes are performed, and what durability and concurrency guarantees are provided. Different engines make fundamentally different trade-offs.
PostgreSQL: Heap Storage
PostgreSQL uses a straightforward heap storage model:
- The table is stored as a sequence of 8KB pages in a file (one file per 1GB segment).
- Rows are inserted wherever there's free space (tracked by the Free Space Map, or FSM).
- Rows are not stored in any particular order (unless you explicitly run
CLUSTER, which is a one-time operation). - Each row version includes a header (23 bytes for system columns: xmin, xmax, ctid, etc.) plus the actual data.
- Deleted/updated rows leave dead tuples that must be reclaimed by VACUUM.
TOAST (The Oversized-Attribute Storage Technique): When a row exceeds ~2KB, PostgreSQL automatically compresses and/or moves large column values to a separate TOAST table. This keeps the main table pages efficient for scanning.
-- Check if a table has a TOAST table
SELECT relname, reltoastrelid::regclass AS toast_table
FROM pg_class WHERE relname = 'documents';
-- relname | toast_table
-- -----------+---------------------------
-- documents | pg_toast.pg_toast_16384
-- Check TOAST table size
SELECT pg_size_pretty(pg_total_relation_size('pg_toast.pg_toast_16384'));
-- Monitor table size and bloat
SELECT
relname,
pg_size_pretty(pg_relation_size(relid)) AS table_size,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) AS index_and_toast_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;
MySQL: InnoDB vs MyISAM
| Feature | InnoDB | MyISAM |
|---|---|---|
| ACID compliance | Full (transactions, crash recovery) | No transactions, no crash recovery |
| Locking | Row-level locking via MVCC | Table-level locking |
| Index structure | Clustered B+ tree (PK = data). Secondary indexes store PK values. | B+ tree with pointers to physical row offsets |
| Foreign keys | Supported | Not supported |
| Crash recovery | Redo log + doublewrite buffer → automatic recovery | No recovery. REPAIR TABLE may lose data. |
| Full-text search | Supported (since 5.6) | Supported (historically the main reason to use MyISAM) |
| Compression | Page-level compression, transparent | myisampack for read-only compressed tables |
| Use case (2025) | Everything. Default since MySQL 5.5. | Legacy only. No reason to use in new projects. |
B-Tree vs LSM-Tree Engines
Beyond the traditional B+ tree engines (PostgreSQL, InnoDB), a fundamentally different approach exists: the Log-Structured Merge (LSM) tree, used by RocksDB, LevelDB, Cassandra, and HBase.
| Property | B+ Tree (InnoDB, PostgreSQL) | LSM Tree (RocksDB, Cassandra) |
|---|---|---|
| Write pattern | In-place updates: find the page, modify it | Append-only: writes go to a memtable, then flushed as immutable sorted runs (SSTables) |
| Write amplification | Lower for small updates (modify one page) | Higher due to compaction (data rewritten during merge) |
| Read amplification | Lower: single B+ tree traversal | Higher: may check memtable + multiple SSTable levels. Bloom filters mitigate. |
| Space amplification | Higher (in-place updates, page fragmentation, MVCC bloat) | Can be lower after compaction (but temp space during compaction) |
| Write throughput | Good (limited by random I/O for page writes) | Excellent (all writes are sequential) |
| Read latency | Predictable (O(log N) always) | Variable (depends on compaction state, may need to check multiple levels) |
| Best for | Read-heavy OLTP, complex queries, transactions | Write-heavy workloads, time-series, logging, ingestion pipelines |
-- MySQL with RocksDB engine (MyRocks - used by Facebook for MySQL):
CREATE TABLE events (
id BIGINT PRIMARY KEY,
event_type VARCHAR(50),
payload JSON,
created_at TIMESTAMP
) ENGINE=ROCKSDB;
-- Advantages over InnoDB for Facebook's use case:
-- - 50% less storage (better compression on sorted data)
-- - Higher write throughput for their workload
-- - Lower write amplification
-- PostgreSQL doesn't support pluggable engines natively,
-- but extensions like pg_tde and columnar engines (citus_columnar) exist.
Choosing the Right Storage Approach
- OLTP (orders, users, payments) → B+ tree engine (InnoDB, PostgreSQL). Predictable reads, ACID transactions, row-level locking.
- Write-heavy ingestion (logs, events, metrics) → LSM engine (RocksDB, Cassandra). Sequential writes, better compression, higher write throughput.
- Analytics (reporting, BI) → Columnar storage (ClickHouse, Redshift, BigQuery). Scans only needed columns, extreme compression ratios.
- Mixed workloads → Consider HTAP databases (TiDB, CockroachDB) that combine row and columnar storage.