SQL vs NoSQL
Database Landscape
Every production system needs to persist data, and the database you choose shapes nearly everything downstream — your query patterns, consistency guarantees, scaling strategy, and operational complexity. The database landscape has evolved dramatically from the single-option world of relational databases in the 1970s to today's ecosystem of specialized engines, each optimized for a particular access pattern.
Here's a quick timeline of how we got here:
- 1970s: Edgar Codd publishes the relational model. IBM builds System R; Oracle ships the first commercial RDBMS.
- 1980s–90s: SQL becomes the standard. PostgreSQL, MySQL, and SQL Server dominate. Object databases appear briefly but never gain mainstream traction.
- 2000s: Google publishes BigTable (2006) and Amazon ships Dynamo (2007). These papers spark the NoSQL movement — systems that trade strict consistency for horizontal scalability and flexible schemas.
- 2010s: MongoDB, Cassandra, Redis, Neo4j, and Elasticsearch mature. The term "polyglot persistence" emerges: use the right database for each workload within a single system.
- 2020s: NewSQL databases (CockroachDB, TiDB, Spanner) attempt to combine SQL's transactional guarantees with NoSQL's scalability. Serverless databases (PlanetScale, Neon, DynamoDB on-demand) reduce operational burden.
The modern landscape breaks down into five major categories:
| Category | Data Model | Examples |
|---|---|---|
| SQL Relational | Tables with rows & columns, foreign keys | PostgreSQL, MySQL, Oracle |
| DOC Document | JSON/BSON documents in collections | MongoDB, CouchDB, Firestore |
| KV Key-Value | Opaque values indexed by key | Redis, DynamoDB, Memcached |
| COL Column-Family | Wide rows with dynamic columns | Cassandra, HBase, ScyllaDB |
| GRAPH Graph | Nodes + edges with properties | Neo4j, Amazon Neptune, JanusGraph |
Relational Databases (SQL)
Relational databases store data in tables (relations) composed of rows (tuples) and columns (attributes). Each table has a fixed schema that defines column names, types, and constraints. Relationships between tables are expressed through foreign keys, and data is queried using SQL (Structured Query Language).
Core Concepts
Example Schema — E-Commerce Orders
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT REFERENCES users(id),
status VARCHAR(20) DEFAULT 'pending',
total_cents BIGINT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE order_items (
id BIGSERIAL PRIMARY KEY,
order_id BIGINT REFERENCES orders(id),
product_id BIGINT NOT NULL,
quantity INT NOT NULL,
price_cents BIGINT NOT NULL
);
-- Index for fast lookup by user
CREATE INDEX idx_orders_user ON orders(user_id);
-- Composite index for queries filtered by status + date
CREATE INDEX idx_orders_status_date ON orders(status, created_at DESC);
ACID Transactions
The defining strength of relational databases is ACID compliance:
- Atomicity: A transaction either fully completes or fully rolls back. If step 3 of a 5-step transaction fails, steps 1 and 2 are undone.
- Consistency: Every transaction moves the database from one valid state to another. Constraints (foreign keys, uniqueness, CHECK constraints) are enforced.
- Isolation: Concurrent transactions don't interfere with each other. Isolation levels (READ UNCOMMITTED → READ COMMITTED → REPEATABLE READ → SERIALIZABLE) let you trade performance for strictness.
- Durability: Once committed, data survives power failures. The WAL (Write-Ahead Log) ensures this.
-- Transfer $100 from Account A to Account B
BEGIN;
UPDATE accounts SET balance = balance - 10000 WHERE id = 'A';
UPDATE accounts SET balance = balance + 10000 WHERE id = 'B';
-- If either UPDATE fails, ROLLBACK undoes both
-- Constraints: CHECK (balance >= 0) prevents overdrafts
COMMIT;
SQL Joins
Joins are the mechanism for combining data spread across normalized tables. Understanding join types is essential for system design interviews:
-- "Get all orders with user info and item details"
SELECT u.name, u.email,
o.id AS order_id, o.status, o.total_cents,
oi.product_id, oi.quantity, oi.price_cents
FROM users u
JOIN orders o ON o.user_id = u.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.created_at >= '2026-01-01'
ORDER BY o.created_at DESC
LIMIT 50;
| Join Type | Returns | Use Case |
|---|---|---|
INNER JOIN | Only matching rows from both tables | Orders with their users |
LEFT JOIN | All rows from left + matches from right (NULL if none) | All users, even those without orders |
RIGHT JOIN | All rows from right + matches from left | Rarely used; rewrite as LEFT JOIN |
FULL OUTER | All rows from both tables, NULLs where no match | Data reconciliation |
CROSS JOIN | Cartesian product (every row × every row) | Generating combinations |
When to Use SQL
- Complex queries with joins: Analytics, reporting, ad-hoc queries across multiple entities.
- Strict transactional requirements: Financial systems, inventory management, booking systems.
- Structured, well-defined data: When your schema is stable and data integrity is paramount.
- Mature tooling needs: ORMs, migration frameworks, monitoring — the SQL ecosystem is the most mature.
NoSQL Categories
"NoSQL" is an umbrella term for databases that don't use the traditional relational model. The name originally meant "Not Only SQL" — these databases aren't anti-SQL; they offer different trade-offs. Each NoSQL category is optimized for specific access patterns.
Document Stores
DOC Document databases store data as self-contained JSON/BSON documents within collections. Unlike SQL rows which must conform to a fixed schema, each document can have a different structure. This makes them ideal for evolving schemas and nested data.
MongoDB Document — Same Order Data
// Single document in the "orders" collection
{
"_id": ObjectId("665a1b2c3d4e5f6a7b8c9d0e"),
"user": {
"id": 42,
"name": "Alice Chen",
"email": "alice@example.com"
},
"status": "shipped",
"total_cents": 15990,
"items": [
{ "product_id": 101, "name": "Mechanical Keyboard", "qty": 1, "price_cents": 12990 },
{ "product_id": 205, "name": "USB-C Cable", "qty": 2, "price_cents": 1500 }
],
"shipping": {
"address": "123 Market St, SF, CA 94105",
"carrier": "USPS",
"tracking": "9400111899223100012345"
},
"created_at": ISODate("2026-04-15T10:30:00Z")
}
Notice how the entire order — user info, line items, shipping details — lives in one document. No joins needed. This is called denormalization and it's the fundamental trade-off: you duplicate data (user name is stored in every order) to gain read performance.
Key characteristics:
- Flexible schema: Add fields to new documents without ALTER TABLE migrations.
- Nested data: Arrays and embedded objects model one-to-many relationships naturally.
- Horizontal scaling: Shard by a key (e.g.,
user.id) to distribute data across nodes. - Rich queries: MongoDB supports secondary indexes, aggregation pipelines, full-text search, and geospatial queries.
- Trade-off: No multi-document ACID transactions across shards (MongoDB added multi-document transactions in 4.0, but with performance caveats).
// MongoDB aggregation: revenue per product this month
db.orders.aggregate([
{ $match: { created_at: { $gte: ISODate("2026-04-01") } } },
{ $unwind: "$items" },
{ $group: {
_id: "$items.product_id",
revenue: { $sum: { $multiply: ["$items.qty", "$items.price_cents"] } },
units_sold: { $sum: "$items.qty" }
}},
{ $sort: { revenue: -1 } },
{ $limit: 10 }
]);
Key-Value Stores
KV The simplest NoSQL model: every piece of data is stored as a key → value pair. The database treats values as opaque blobs — it doesn't know or care about their structure. This simplicity enables extreme performance: O(1) lookups, sub-millisecond latency, and straightforward horizontal scaling.
Redis — Common Patterns
# Session storage
SET session:abc123 '{"user_id":42,"role":"admin","cart":[101,205]}' EX 3600
# Rate limiting (sliding window)
INCR rate:user:42:minute
EXPIRE rate:user:42:minute 60
# Leaderboard
ZADD leaderboard 9500 "alice" 8700 "bob" 9200 "charlie"
ZREVRANGE leaderboard 0 9 WITHSCORES
# Pub/Sub for real-time events
PUBLISH order:updates '{"order_id":"665a","status":"shipped"}'
# Distributed lock
SET lock:order:665a "worker-7" NX EX 30
Key characteristics:
- Blazing fast: Redis handles 100,000+ operations/second on a single node. Data lives in memory.
- Data structures: Redis goes beyond simple key-value: strings, hashes, lists, sets, sorted sets, streams, bitmaps, HyperLogLog.
- Ephemeral by nature: Ideal for caching, sessions, rate limiters, and real-time leaderboards.
- Persistence optional: Redis offers RDB snapshots and AOF (Append-Only File) for durability, but many deployments treat it as a cache layer.
- Trade-off: No complex queries, no joins, limited by memory size.
DynamoDB extends the key-value model with a sort key, enabling range queries within a partition:
// DynamoDB: Partition Key = user_id, Sort Key = order_date
{
"PK": "USER#42",
"SK": "ORDER#2026-04-15T10:30:00Z",
"status": "shipped",
"total_cents": 15990
}
// Query: all orders for user 42 in April 2026
aws dynamodb query \
--key-condition "PK = :pk AND SK BETWEEN :start AND :end"
Column-Family Stores
COL Column-family databases organize data into rows identified by a row key, where each row can have a different set of columns grouped into column families. Unlike relational databases where every row has the same columns, column-family stores support sparse, wide rows with potentially millions of columns per row.
Cassandra — Time-Series Sensor Data
CREATE TABLE sensor_readings (
sensor_id TEXT,
bucket TEXT, -- e.g., '2026-04-15' for daily bucketing
reading_ts TIMESTAMP,
temperature DOUBLE,
humidity DOUBLE,
pressure DOUBLE,
PRIMARY KEY ((sensor_id, bucket), reading_ts)
) WITH CLUSTERING ORDER BY (reading_ts DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1};
-- Write: append-only, no read-before-write
INSERT INTO sensor_readings (sensor_id, bucket, reading_ts, temperature, humidity)
VALUES ('temp-sf-001', '2026-04-15', '2026-04-15T14:30:00Z', 18.5, 62.3);
-- Read: efficient range scan within partition
SELECT * FROM sensor_readings
WHERE sensor_id = 'temp-sf-001' AND bucket = '2026-04-15'
AND reading_ts >= '2026-04-15T12:00:00Z'
AND reading_ts < '2026-04-15T18:00:00Z';
Key characteristics:
- Write-optimized: Cassandra uses an LSM-tree storage engine. Writes are sequential (append to commit log + memtable), achieving ~100× faster writes than B-tree databases for certain workloads.
- Linear scalability: Add nodes → throughput scales linearly. No single master; every node is equal (peer-to-peer gossip protocol).
- Tunable consistency: Choose per-query:
ONE(fastest),QUORUM(balanced),ALL(strongest). Trade latency for consistency. - Partition-centric: All efficient queries are within a single partition. Cross-partition queries (scatter-gather) are expensive.
- Trade-off: No joins, no ad-hoc queries, no secondary indexes at scale (use materialized views or denormalization instead). Data modeling is driven by query patterns, not entity relationships.
Graph Databases
GRAPH Graph databases store data as nodes (entities) connected by edges (relationships), where both can have properties. They excel at traversing deep, complex relationships — the type of queries that would require recursive self-joins or CTEs in SQL and become prohibitively slow at depth 3+.
Neo4j — Social Network
// Create nodes
CREATE (alice:User {name: "Alice", age: 28})
CREATE (bob:User {name: "Bob", age: 32})
CREATE (eve:User {name: "Eve", age: 25})
CREATE (post1:Post {text: "Graph databases are amazing!", created: datetime()})
// Create relationships
CREATE (alice)-[:FOLLOWS]->(bob)
CREATE (bob)-[:FOLLOWS]->(eve)
CREATE (alice)-[:FOLLOWS]->(eve)
CREATE (alice)-[:WROTE]->(post1)
CREATE (bob)-[:LIKED]->(post1)
// "Friend-of-friend recommendations" — 2-hop traversal
MATCH (me:User {name: "Alice"})-[:FOLLOWS]->()-[:FOLLOWS]->(recommended)
WHERE NOT (me)-[:FOLLOWS]->(recommended) AND recommended <> me
RETURN DISTINCT recommended.name, COUNT(*) AS mutual_connections
ORDER BY mutual_connections DESC
LIMIT 10;
// "Shortest path between two users"
MATCH path = shortestPath(
(a:User {name: "Alice"})-[:FOLLOWS*..6]-(b:User {name: "Eve"})
)
RETURN path;
Key characteristics:
- Index-free adjacency: Each node stores direct pointers to its neighbors. Traversing a relationship is O(1), regardless of total graph size.
- Relationship-first: Relationships are first-class citizens, not implicit join columns. Each edge has a type, direction, and properties.
- Deep traversals: Queries like "find all paths up to 6 hops" that would timeout in SQL complete in milliseconds.
- Use cases: Social networks, recommendation engines, fraud detection, knowledge graphs, network topology, access control (RBAC).
- Trade-off: Not great for aggregations, bulk analytics, or data that doesn't have meaningful relationships. Scaling is harder than Cassandra/DynamoDB.
▶ SQL Join vs Document Lookup
Compare a 3-table SQL JOIN to a single MongoDB document read for the same data.
SQL vs NoSQL Comparison
This is the table you'll reference in every system design interview. Understanding these trade-offs is what separates a junior answer ("I'll use MongoDB because it's fast") from a senior one ("I'll use PostgreSQL for orders because we need ACID, and Redis for the cart because it's ephemeral and latency-sensitive").
| Dimension | SQL | Document | Key-Value | Column-Family | Graph |
|---|---|---|---|---|---|
| Schema | Fixed, enforced | Flexible per document | Schema-less | Dynamic columns | Property graph |
| Scaling | Vertical (read replicas help) | Horizontal (sharding) | Horizontal (trivial) | Horizontal (linear) | Vertical + graph partitioning |
| ACID | Full | Single-doc; multi-doc limited | Per-key atomic | Per-partition atomic | Full (Neo4j) |
| Query Language | SQL (declarative) | MQL / JSON queries | GET/SET commands | CQL (SQL-like) | Cypher / Gremlin |
| Consistency | Strong | Eventual (tunable) | Eventual (tunable) | Tunable per query | Strong (single node) |
| Read Perf | Excellent with indexes | Excellent (single-doc) | Best (O(1) lookup) | Great (partition scan) | Best for traversals |
| Write Perf | Good (B-tree overhead) | Good | Excellent | Best (LSM-tree) | Moderate |
| Best For | Transactions, analytics, complex queries | Content, catalogs, user profiles | Caching, sessions, counters | Time-series, IoT, logs | Social graphs, recommendations |
Decision Matrix
Use this quick checklist when choosing between SQL and NoSQL in a system design interview:
Choose SQL When…
- Data has clear relationships and you need JOIN queries
- You need multi-row/multi-table ACID transactions
- Schema is well-known and relatively stable
- You need complex aggregations (GROUP BY, window functions)
- Regulatory compliance requires strict data integrity
- Team is experienced with SQL
Choose NoSQL When…
- Schema evolves rapidly or varies per record
- Access pattern is simple (key lookup, single-entity reads)
- You need horizontal scaling beyond a single machine
- Write volume is extremely high (100K+ writes/sec)
- Data is naturally hierarchical or graph-shaped
- Eventual consistency is acceptable
Indexing Strategies
Indexes are the single most important performance lever in any database. An unindexed query on a million rows does a full table scan (reads every row). An indexed query does a tree traversal and reads only the matching rows. The difference can be 10,000×.
B-Tree Indexes (SQL Default)
B-trees are balanced, self-sorting tree structures that keep data in sorted order and allow searches, insertions, and deletions in O(log n) time. PostgreSQL and MySQL use B+ trees (where leaf nodes form a linked list for efficient range scans).
[30 | 70] ← Root
/ | \
[10|20] [50|60] [80|90] ← Internal
/ | \ / | \ / | \
[data] [data] [data] ← Leaf (linked list →)
- Read: O(log n) — typically 3-4 disk seeks for millions of rows
- Write: O(log n) — must update tree + maintain balance
- Range scans: Excellent (follow leaf linked list)
- Best for: Equality + range queries, ORDER BY
Hash Indexes
Hash indexes map keys through a hash function to bucket locations. They provide O(1) lookup for exact-match queries but cannot handle range queries or sorting.
-- PostgreSQL: hash index for exact-match lookups
CREATE INDEX idx_session_token ON sessions USING hash (token);
-- Fast: WHERE token = 'abc123'
-- Cannot use: WHERE token > 'abc' (range scan impossible)
LSM-Tree + SSTables (NoSQL)
Log-Structured Merge Trees are the storage engine behind Cassandra, RocksDB, LevelDB, and HBase. They're optimized for write-heavy workloads:
LSM-Tree Write Path
1. Write → WAL (Write-Ahead Log) — sequential disk write (fast)
2. Write → Memtable (in-memory sorted tree, e.g., red-black tree)
3. When Memtable is full → Flush to disk as an SSTable (Sorted String Table)
4. Background compaction merges SSTables periodically
Read path (slower):
1. Check Memtable (in-memory)
2. Check Bloom filter for each SSTable level (skip files that definitely don't have the key)
3. Search SSTables from newest to oldest
Why it's fast for writes: All writes are sequential (append-only). No random I/O to update a B-tree node. The trade-off is read amplification — a read might check multiple SSTables before finding the data.
Inverted Indexes (Search)
Used by Elasticsearch and Solr. An inverted index maps terms → list of documents that contain them. This is how full-text search works:
Term → Document IDs
───────────────────────────────
"database" → [doc1, doc3, doc7, doc12]
"nosql" → [doc3, doc5, doc7]
"performance" → [doc1, doc5, doc9]
Query: "nosql AND performance"
→ Intersect: [doc3, doc5, doc7] ∩ [doc1, doc5, doc9] = [doc5]
Composite Indexes
A composite index covers multiple columns. Column order matters — the index is sorted by the first column, then second within ties, and so on (leftmost prefix rule):
CREATE INDEX idx_user_status_date ON orders(user_id, status, created_at DESC);
-- ✅ Uses index (matches leftmost prefix):
WHERE user_id = 42
WHERE user_id = 42 AND status = 'shipped'
WHERE user_id = 42 AND status = 'shipped' AND created_at > '2026-01-01'
-- ❌ Cannot use index efficiently (skips user_id):
WHERE status = 'shipped'
WHERE created_at > '2026-01-01'
Index Impact: Read vs Write
| Index Type | Read Boost | Write Cost | Storage Overhead |
|---|---|---|---|
| B-Tree | 10–10,000× | ~10–30% slower writes | ~10–20% of table size |
| Hash | O(1) for exact match | Minimal | ~10% of table size |
| LSM-Tree | Good (with Bloom filters) | Minimal (sequential I/O) | Compaction uses extra space |
| Inverted | Instant full-text search | Significant (tokenize + update) | Can exceed table size |
Polyglot Persistence
Real-world systems almost never use a single database. Polyglot persistence means using multiple database types within one system, each chosen for the workload it handles best.
Example: E-Commerce Platform
Multiple Databases in One System
┌─────────────────────────────────────────────────────────────────┐
│ API Gateway │
└──────┬──────────┬──────────┬──────────┬──────────┬─────────────┘
│ │ │ │ │
┌────▼───┐ ┌───▼────┐ ┌──▼───┐ ┌───▼───┐ ┌───▼────────┐
│ SQL │ │ Redis │ │Elastic│ │ Neo4j │ │ Cassandra │
│(Postgres)│ │(Cache) │ │search│ │(Recs) │ │(Event Log) │
└────────┘ └────────┘ └──────┘ └───────┘ └────────────┘
│ │ │ │ │
Orders & Shopping Product "Users who Clickstream
Payments Cart & Search & also bought" & Activity
Inventory Sessions Autocomplete engine Tracking
| Service | Database | Why This DB |
|---|---|---|
| Orders & Payments | PostgreSQL | ACID transactions, complex queries, financial audit trails |
| Shopping Cart | Redis | Sub-ms latency, ephemeral data, TTL for abandoned carts |
| Product Search | Elasticsearch | Inverted index, fuzzy matching, faceted search, autocomplete |
| Recommendations | Neo4j | Graph traversals for "users who bought X also bought Y" |
| Activity Log | Cassandra | Append-only, massive write throughput, time-series partitioning |
Data Synchronization Challenges
The hardest part of polyglot persistence is keeping data consistent across databases. Common strategies:
- Change Data Capture (CDC): Use tools like Debezium to stream PostgreSQL WAL changes to Kafka, then consume events to update Elasticsearch, Redis, etc. This provides near-real-time sync with minimal coupling.
- Dual Writes: Write to both databases in application code. Avoid this — if one write fails, your databases are inconsistent with no recovery mechanism.
- Event Sourcing: Publish domain events (OrderPlaced, ItemShipped) to a message broker. Each database subscribes to relevant events and updates itself independently.
- Outbox Pattern: Write the business data AND an event record to the same SQL database in one transaction. A background worker reads the outbox table and publishes to Kafka.
-- Outbox pattern: atomic write + event in one transaction
BEGIN;
INSERT INTO orders (user_id, total_cents, status)
VALUES (42, 15990, 'created');
INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
VALUES ('Order', currval('orders_id_seq'), 'OrderCreated',
'{"user_id":42,"total":15990}');
COMMIT;
-- Background worker polls outbox → publishes to Kafka → deletes outbox row
Database Selection Framework
In a system design interview, choosing a database should be a deliberate, reasoned decision, not a default. Walk through these questions:
▶ Database Decision Tree
Step through the flowchart to choose the right database for your workload.
Questions to Ask
- What does the data look like? — Structured tables? Nested documents? A graph of relationships? Time-series events?
- What are the primary access patterns? — Key lookups? Range scans? Complex joins? Full-text search? Graph traversals?
- What consistency level is required? — Strong ACID? Eventual consistency? Tunable per query?
- What is the expected scale? — 1K reads/sec (any database works) vs. 1M reads/sec (need horizontal scaling)?
- What's the read:write ratio? — Read-heavy → optimize indexes, caching. Write-heavy → LSM-tree databases, append-only stores.
- What are the latency requirements? — Sub-millisecond (Redis) vs. acceptable 10-50ms (SQL with indexes)?
- What does the team know? — A PostgreSQL expert team shipping with Postgres beats a team struggling with Cassandra's data modeling.
Quick Reference: Database per Use Case
| Use Case | Recommended | Why |
|---|---|---|
| User accounts & auth | PostgreSQL | Strong consistency, constraints, mature auth libraries |
| Session management | Redis | Sub-ms lookups, TTL expiry, in-memory speed |
| Product catalog | MongoDB | Varying attributes per product category, nested specs |
| Full-text search | Elasticsearch | Inverted index, relevance scoring, autocomplete |
| IoT sensor data | Cassandra / TimescaleDB | Append-only writes, time-windowed queries, linear scale |
| Social network | Neo4j | Friend-of-friend, shortest path, community detection |
| Financial ledger | PostgreSQL / CockroachDB | ACID, serializable isolation, audit trails |
| Real-time analytics | ClickHouse / Druid | Columnar storage, fast aggregations on billions of rows |
| Config & feature flags | Redis / etcd | Fast reads, pub/sub for change notification |