← All Posts
High Level Design Series · Foundations · Part 5

SQL vs NoSQL

Database Landscape

Every production system needs to persist data, and the database you choose shapes nearly everything downstream — your query patterns, consistency guarantees, scaling strategy, and operational complexity. The database landscape has evolved dramatically from the single-option world of relational databases in the 1970s to today's ecosystem of specialized engines, each optimized for a particular access pattern.

Here's a quick timeline of how we got here:

The modern landscape breaks down into five major categories:

CategoryData ModelExamples
SQL RelationalTables with rows & columns, foreign keysPostgreSQL, MySQL, Oracle
DOC DocumentJSON/BSON documents in collectionsMongoDB, CouchDB, Firestore
KV Key-ValueOpaque values indexed by keyRedis, DynamoDB, Memcached
COL Column-FamilyWide rows with dynamic columnsCassandra, HBase, ScyllaDB
GRAPH GraphNodes + edges with propertiesNeo4j, Amazon Neptune, JanusGraph
Key insight: There is no universally "best" database. The right choice depends on your data shape, access patterns, consistency requirements, and scale. Most real systems use multiple databases — a pattern called polyglot persistence.

Relational Databases (SQL)

Relational databases store data in tables (relations) composed of rows (tuples) and columns (attributes). Each table has a fixed schema that defines column names, types, and constraints. Relationships between tables are expressed through foreign keys, and data is queried using SQL (Structured Query Language).

Core Concepts

Example Schema — E-Commerce Orders

CREATE TABLE users (
    id          BIGSERIAL PRIMARY KEY,
    email       VARCHAR(255) UNIQUE NOT NULL,
    name        VARCHAR(100) NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE orders (
    id          BIGSERIAL PRIMARY KEY,
    user_id     BIGINT REFERENCES users(id),
    status      VARCHAR(20) DEFAULT 'pending',
    total_cents BIGINT NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE order_items (
    id          BIGSERIAL PRIMARY KEY,
    order_id    BIGINT REFERENCES orders(id),
    product_id  BIGINT NOT NULL,
    quantity    INT NOT NULL,
    price_cents BIGINT NOT NULL
);

-- Index for fast lookup by user
CREATE INDEX idx_orders_user ON orders(user_id);
-- Composite index for queries filtered by status + date
CREATE INDEX idx_orders_status_date ON orders(status, created_at DESC);

ACID Transactions

The defining strength of relational databases is ACID compliance:

-- Transfer $100 from Account A to Account B
BEGIN;
  UPDATE accounts SET balance = balance - 10000 WHERE id = 'A';
  UPDATE accounts SET balance = balance + 10000 WHERE id = 'B';

  -- If either UPDATE fails, ROLLBACK undoes both
  -- Constraints: CHECK (balance >= 0) prevents overdrafts
COMMIT;

SQL Joins

Joins are the mechanism for combining data spread across normalized tables. Understanding join types is essential for system design interviews:

-- "Get all orders with user info and item details"
SELECT u.name, u.email,
       o.id AS order_id, o.status, o.total_cents,
       oi.product_id, oi.quantity, oi.price_cents
FROM users u
JOIN orders o      ON o.user_id = u.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.created_at >= '2026-01-01'
ORDER BY o.created_at DESC
LIMIT 50;
Join TypeReturnsUse Case
INNER JOINOnly matching rows from both tablesOrders with their users
LEFT JOINAll rows from left + matches from right (NULL if none)All users, even those without orders
RIGHT JOINAll rows from right + matches from leftRarely used; rewrite as LEFT JOIN
FULL OUTERAll rows from both tables, NULLs where no matchData reconciliation
CROSS JOINCartesian product (every row × every row)Generating combinations

When to Use SQL

Popular choices: PostgreSQL — the default for most startups (extensible, feature-rich, free). MySQL — powers most of the web (WordPress, Shopify). SQL Server — enterprise .NET ecosystems. Oracle — legacy enterprise (expensive, powerful).

NoSQL Categories

"NoSQL" is an umbrella term for databases that don't use the traditional relational model. The name originally meant "Not Only SQL" — these databases aren't anti-SQL; they offer different trade-offs. Each NoSQL category is optimized for specific access patterns.

Document Stores

DOC Document databases store data as self-contained JSON/BSON documents within collections. Unlike SQL rows which must conform to a fixed schema, each document can have a different structure. This makes them ideal for evolving schemas and nested data.

MongoDB Document — Same Order Data

// Single document in the "orders" collection
{
  "_id": ObjectId("665a1b2c3d4e5f6a7b8c9d0e"),
  "user": {
    "id": 42,
    "name": "Alice Chen",
    "email": "alice@example.com"
  },
  "status": "shipped",
  "total_cents": 15990,
  "items": [
    { "product_id": 101, "name": "Mechanical Keyboard", "qty": 1, "price_cents": 12990 },
    { "product_id": 205, "name": "USB-C Cable",         "qty": 2, "price_cents": 1500  }
  ],
  "shipping": {
    "address": "123 Market St, SF, CA 94105",
    "carrier": "USPS",
    "tracking": "9400111899223100012345"
  },
  "created_at": ISODate("2026-04-15T10:30:00Z")
}

Notice how the entire order — user info, line items, shipping details — lives in one document. No joins needed. This is called denormalization and it's the fundamental trade-off: you duplicate data (user name is stored in every order) to gain read performance.

Key characteristics:

// MongoDB aggregation: revenue per product this month
db.orders.aggregate([
  { $match: { created_at: { $gte: ISODate("2026-04-01") } } },
  { $unwind: "$items" },
  { $group: {
      _id: "$items.product_id",
      revenue: { $sum: { $multiply: ["$items.qty", "$items.price_cents"] } },
      units_sold: { $sum: "$items.qty" }
  }},
  { $sort: { revenue: -1 } },
  { $limit: 10 }
]);

Key-Value Stores

KV The simplest NoSQL model: every piece of data is stored as a key → value pair. The database treats values as opaque blobs — it doesn't know or care about their structure. This simplicity enables extreme performance: O(1) lookups, sub-millisecond latency, and straightforward horizontal scaling.

Redis — Common Patterns

# Session storage
SET session:abc123 '{"user_id":42,"role":"admin","cart":[101,205]}' EX 3600

# Rate limiting (sliding window)
INCR   rate:user:42:minute
EXPIRE rate:user:42:minute 60

# Leaderboard
ZADD leaderboard 9500 "alice" 8700 "bob" 9200 "charlie"
ZREVRANGE leaderboard 0 9 WITHSCORES

# Pub/Sub for real-time events
PUBLISH order:updates '{"order_id":"665a","status":"shipped"}'

# Distributed lock
SET lock:order:665a "worker-7" NX EX 30

Key characteristics:

DynamoDB extends the key-value model with a sort key, enabling range queries within a partition:

// DynamoDB: Partition Key = user_id, Sort Key = order_date
{
  "PK": "USER#42",
  "SK": "ORDER#2026-04-15T10:30:00Z",
  "status": "shipped",
  "total_cents": 15990
}
// Query: all orders for user 42 in April 2026
aws dynamodb query \
  --key-condition "PK = :pk AND SK BETWEEN :start AND :end"

Column-Family Stores

COL Column-family databases organize data into rows identified by a row key, where each row can have a different set of columns grouped into column families. Unlike relational databases where every row has the same columns, column-family stores support sparse, wide rows with potentially millions of columns per row.

Cassandra — Time-Series Sensor Data

CREATE TABLE sensor_readings (
    sensor_id   TEXT,
    bucket      TEXT,           -- e.g., '2026-04-15' for daily bucketing
    reading_ts  TIMESTAMP,
    temperature DOUBLE,
    humidity    DOUBLE,
    pressure    DOUBLE,
    PRIMARY KEY ((sensor_id, bucket), reading_ts)
) WITH CLUSTERING ORDER BY (reading_ts DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1};

-- Write: append-only, no read-before-write
INSERT INTO sensor_readings (sensor_id, bucket, reading_ts, temperature, humidity)
VALUES ('temp-sf-001', '2026-04-15', '2026-04-15T14:30:00Z', 18.5, 62.3);

-- Read: efficient range scan within partition
SELECT * FROM sensor_readings
WHERE sensor_id = 'temp-sf-001' AND bucket = '2026-04-15'
  AND reading_ts >= '2026-04-15T12:00:00Z'
  AND reading_ts  < '2026-04-15T18:00:00Z';

Key characteristics:

Graph Databases

GRAPH Graph databases store data as nodes (entities) connected by edges (relationships), where both can have properties. They excel at traversing deep, complex relationships — the type of queries that would require recursive self-joins or CTEs in SQL and become prohibitively slow at depth 3+.

Neo4j — Social Network

// Create nodes
CREATE (alice:User {name: "Alice", age: 28})
CREATE (bob:User   {name: "Bob",   age: 32})
CREATE (eve:User   {name: "Eve",   age: 25})
CREATE (post1:Post {text: "Graph databases are amazing!", created: datetime()})

// Create relationships
CREATE (alice)-[:FOLLOWS]->(bob)
CREATE (bob)-[:FOLLOWS]->(eve)
CREATE (alice)-[:FOLLOWS]->(eve)
CREATE (alice)-[:WROTE]->(post1)
CREATE (bob)-[:LIKED]->(post1)

// "Friend-of-friend recommendations" — 2-hop traversal
MATCH (me:User {name: "Alice"})-[:FOLLOWS]->()-[:FOLLOWS]->(recommended)
WHERE NOT (me)-[:FOLLOWS]->(recommended) AND recommended <> me
RETURN DISTINCT recommended.name, COUNT(*) AS mutual_connections
ORDER BY mutual_connections DESC
LIMIT 10;

// "Shortest path between two users"
MATCH path = shortestPath(
  (a:User {name: "Alice"})-[:FOLLOWS*..6]-(b:User {name: "Eve"})
)
RETURN path;

Key characteristics:

▶ SQL Join vs Document Lookup

Compare a 3-table SQL JOIN to a single MongoDB document read for the same data.

SQL vs NoSQL Comparison

This is the table you'll reference in every system design interview. Understanding these trade-offs is what separates a junior answer ("I'll use MongoDB because it's fast") from a senior one ("I'll use PostgreSQL for orders because we need ACID, and Redis for the cart because it's ephemeral and latency-sensitive").

Dimension SQL Document Key-Value Column-Family Graph
SchemaFixed, enforcedFlexible per documentSchema-lessDynamic columnsProperty graph
ScalingVertical (read replicas help)Horizontal (sharding)Horizontal (trivial)Horizontal (linear)Vertical + graph partitioning
ACIDFullSingle-doc; multi-doc limitedPer-key atomicPer-partition atomicFull (Neo4j)
Query LanguageSQL (declarative)MQL / JSON queriesGET/SET commandsCQL (SQL-like)Cypher / Gremlin
ConsistencyStrongEventual (tunable)Eventual (tunable)Tunable per queryStrong (single node)
Read PerfExcellent with indexesExcellent (single-doc)Best (O(1) lookup)Great (partition scan)Best for traversals
Write PerfGood (B-tree overhead)GoodExcellentBest (LSM-tree)Moderate
Best ForTransactions, analytics, complex queriesContent, catalogs, user profilesCaching, sessions, countersTime-series, IoT, logsSocial graphs, recommendations

Decision Matrix

Use this quick checklist when choosing between SQL and NoSQL in a system design interview:

Choose SQL When…

  • Data has clear relationships and you need JOIN queries
  • You need multi-row/multi-table ACID transactions
  • Schema is well-known and relatively stable
  • You need complex aggregations (GROUP BY, window functions)
  • Regulatory compliance requires strict data integrity
  • Team is experienced with SQL

Choose NoSQL When…

  • Schema evolves rapidly or varies per record
  • Access pattern is simple (key lookup, single-entity reads)
  • You need horizontal scaling beyond a single machine
  • Write volume is extremely high (100K+ writes/sec)
  • Data is naturally hierarchical or graph-shaped
  • Eventual consistency is acceptable
The "NoSQL is always faster" myth: A properly indexed PostgreSQL query on 100M rows often outperforms a poorly-modeled MongoDB collection. Performance depends on data modeling, indexing, and access patterns — not the database category. NoSQL wins on write throughput at scale and simple key lookups; SQL wins on complex queries and data integrity.

Indexing Strategies

Indexes are the single most important performance lever in any database. An unindexed query on a million rows does a full table scan (reads every row). An indexed query does a tree traversal and reads only the matching rows. The difference can be 10,000×.

B-Tree Indexes (SQL Default)

B-trees are balanced, self-sorting tree structures that keep data in sorted order and allow searches, insertions, and deletions in O(log n) time. PostgreSQL and MySQL use B+ trees (where leaf nodes form a linked list for efficient range scans).

Structure
        [30 | 70]           ← Root
       /    |     \
   [10|20] [50|60] [80|90]  ← Internal
   / | \   / | \   / | \
  [data] [data] [data]     ← Leaf (linked list →)
Characteristics
  • Read: O(log n) — typically 3-4 disk seeks for millions of rows
  • Write: O(log n) — must update tree + maintain balance
  • Range scans: Excellent (follow leaf linked list)
  • Best for: Equality + range queries, ORDER BY

Hash Indexes

Hash indexes map keys through a hash function to bucket locations. They provide O(1) lookup for exact-match queries but cannot handle range queries or sorting.

-- PostgreSQL: hash index for exact-match lookups
CREATE INDEX idx_session_token ON sessions USING hash (token);

-- Fast: WHERE token = 'abc123'
-- Cannot use: WHERE token > 'abc' (range scan impossible)

LSM-Tree + SSTables (NoSQL)

Log-Structured Merge Trees are the storage engine behind Cassandra, RocksDB, LevelDB, and HBase. They're optimized for write-heavy workloads:

LSM-Tree Write Path

1. Write → WAL (Write-Ahead Log)  — sequential disk write (fast)
2. Write → Memtable (in-memory sorted tree, e.g., red-black tree)
3. When Memtable is full → Flush to disk as an SSTable (Sorted String Table)
4. Background compaction merges SSTables periodically

Read path (slower):
1. Check Memtable (in-memory)
2. Check Bloom filter for each SSTable level (skip files that definitely don't have the key)
3. Search SSTables from newest to oldest

Why it's fast for writes: All writes are sequential (append-only). No random I/O to update a B-tree node. The trade-off is read amplification — a read might check multiple SSTables before finding the data.

Inverted Indexes (Search)

Used by Elasticsearch and Solr. An inverted index maps terms → list of documents that contain them. This is how full-text search works:

Term           → Document IDs
───────────────────────────────
"database"     → [doc1, doc3, doc7, doc12]
"nosql"        → [doc3, doc5, doc7]
"performance"  → [doc1, doc5, doc9]

Query: "nosql AND performance"
→ Intersect: [doc3, doc5, doc7] ∩ [doc1, doc5, doc9] = [doc5]

Composite Indexes

A composite index covers multiple columns. Column order matters — the index is sorted by the first column, then second within ties, and so on (leftmost prefix rule):

CREATE INDEX idx_user_status_date ON orders(user_id, status, created_at DESC);

-- ✅ Uses index (matches leftmost prefix):
WHERE user_id = 42
WHERE user_id = 42 AND status = 'shipped'
WHERE user_id = 42 AND status = 'shipped' AND created_at > '2026-01-01'

-- ❌ Cannot use index efficiently (skips user_id):
WHERE status = 'shipped'
WHERE created_at > '2026-01-01'

Index Impact: Read vs Write

Index TypeRead BoostWrite CostStorage Overhead
B-Tree10–10,000×~10–30% slower writes~10–20% of table size
HashO(1) for exact matchMinimal~10% of table size
LSM-TreeGood (with Bloom filters)Minimal (sequential I/O)Compaction uses extra space
InvertedInstant full-text searchSignificant (tokenize + update)Can exceed table size
Rule of thumb: Every index speeds up reads but slows down writes. A table with 10 indexes means every INSERT updates 10 separate data structures. For write-heavy workloads, keep indexes minimal and consider LSM-tree based databases.

Polyglot Persistence

Real-world systems almost never use a single database. Polyglot persistence means using multiple database types within one system, each chosen for the workload it handles best.

Example: E-Commerce Platform

Multiple Databases in One System

┌─────────────────────────────────────────────────────────────────┐
│                        API Gateway                              │
└──────┬──────────┬──────────┬──────────┬──────────┬─────────────┘
       │          │          │          │          │
  ┌────▼───┐ ┌───▼────┐ ┌──▼───┐ ┌───▼───┐ ┌───▼────────┐
  │  SQL   │ │ Redis  │ │Elastic│ │ Neo4j │ │ Cassandra  │
  │(Postgres)│ │(Cache) │ │search│ │(Recs) │ │(Event Log) │
  └────────┘ └────────┘ └──────┘ └───────┘ └────────────┘
       │          │          │          │          │
  Orders &   Shopping   Product    "Users who   Clickstream
  Payments   Cart &     Search &   also bought" & Activity
  Inventory  Sessions   Autocomplete  engine     Tracking
ServiceDatabaseWhy This DB
Orders & PaymentsPostgreSQLACID transactions, complex queries, financial audit trails
Shopping CartRedisSub-ms latency, ephemeral data, TTL for abandoned carts
Product SearchElasticsearchInverted index, fuzzy matching, faceted search, autocomplete
RecommendationsNeo4jGraph traversals for "users who bought X also bought Y"
Activity LogCassandraAppend-only, massive write throughput, time-series partitioning

Data Synchronization Challenges

The hardest part of polyglot persistence is keeping data consistent across databases. Common strategies:

-- Outbox pattern: atomic write + event in one transaction
BEGIN;
  INSERT INTO orders (user_id, total_cents, status)
  VALUES (42, 15990, 'created');

  INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
  VALUES ('Order', currval('orders_id_seq'), 'OrderCreated',
          '{"user_id":42,"total":15990}');
COMMIT;
-- Background worker polls outbox → publishes to Kafka → deletes outbox row

Database Selection Framework

In a system design interview, choosing a database should be a deliberate, reasoned decision, not a default. Walk through these questions:

▶ Database Decision Tree

Step through the flowchart to choose the right database for your workload.

Questions to Ask

  1. What does the data look like? — Structured tables? Nested documents? A graph of relationships? Time-series events?
  2. What are the primary access patterns? — Key lookups? Range scans? Complex joins? Full-text search? Graph traversals?
  3. What consistency level is required? — Strong ACID? Eventual consistency? Tunable per query?
  4. What is the expected scale? — 1K reads/sec (any database works) vs. 1M reads/sec (need horizontal scaling)?
  5. What's the read:write ratio? — Read-heavy → optimize indexes, caching. Write-heavy → LSM-tree databases, append-only stores.
  6. What are the latency requirements? — Sub-millisecond (Redis) vs. acceptable 10-50ms (SQL with indexes)?
  7. What does the team know? — A PostgreSQL expert team shipping with Postgres beats a team struggling with Cassandra's data modeling.

Quick Reference: Database per Use Case

Use CaseRecommendedWhy
User accounts & authPostgreSQLStrong consistency, constraints, mature auth libraries
Session managementRedisSub-ms lookups, TTL expiry, in-memory speed
Product catalogMongoDBVarying attributes per product category, nested specs
Full-text searchElasticsearchInverted index, relevance scoring, autocomplete
IoT sensor dataCassandra / TimescaleDBAppend-only writes, time-windowed queries, linear scale
Social networkNeo4jFriend-of-friend, shortest path, community detection
Financial ledgerPostgreSQL / CockroachDBACID, serializable isolation, audit trails
Real-time analyticsClickHouse / DruidColumnar storage, fast aggregations on billions of rows
Config & feature flagsRedis / etcdFast reads, pub/sub for change notification
Interview tip: Always justify your database choice with the specific requirements. "I'll use PostgreSQL for the orders service because we need ACID transactions for payments, and the data is naturally relational with orders, items, and users. For the caching layer, Redis — because cart data is ephemeral, sub-millisecond latency is critical, and we can tolerate losing cart data on failure."