Column-Family Stores (Cassandra, HBase)
The Wide-Column Data Model
A column-family store (also called a wide-column store) organises data into rows and columns—but unlike a relational table, every row can have a different set of columns. There is no fixed schema enforced across all rows. Think of it as a two-dimensional key-value store: the first key identifies the row (partition key), the second key identifies the column, and together they point to a value.
The canonical mental model is a sorted map of sorted maps:
Map<PartitionKey, SortedMap<ColumnKey, Value>>
In practice, the inner map is further grouped into column families. A column family is a logical grouping of columns that are frequently accessed together—similar to tables in RDBMS, but far more flexible. Within a column family, data is stored sorted by the column key, enabling extremely efficient range scans within a single partition.
Column Family vs. Row-Oriented Storage
| Dimension | Row-Oriented (MySQL, PostgreSQL) | Column-Family (Cassandra, HBase) |
|---|---|---|
| Schema | Fixed schema; every row has the same columns | Flexible; each row can have different columns |
| Storage layout | Entire row stored contiguously on disk | Columns within a family stored together; different families on separate files |
| Null handling | NULL values consume space or require bitmap | Missing columns simply don’t exist—zero storage cost |
| Read pattern | Efficient for SELECT * (all columns) | Efficient for reading specific column subsets or ranges |
| Scaling | Typically vertical; sharding is manual | Designed for horizontal scaling from day one |
| Joins | Native JOIN support | No joins; denormalization is the norm |
Conceptual Example
Consider a user profile stored in a column family called users:
Row Key: user_1001
├── name: "Alice Chen"
├── email: "alice@example.com"
├── plan: "premium"
└── last_login: "2026-04-15T08:30:00Z"
Row Key: user_1002
├── name: "Bob Kumar"
├── email: "bob@example.com"
└── phone: "+1-555-0199" ← column that user_1001 doesn't have
(no plan, no last_login for this row)
Notice that user_1002 has a phone column that user_1001 doesn’t, and lacks plan and last_login. Neither scenario causes errors or wastes storage. This sparse data model is one of the core advantages of wide-column stores for datasets with highly variable attributes.
Cassandra Deep Dive: Architecture
Apache Cassandra is a masterclass in distributed systems design. It merges the column-family data model from Bigtable with the peer-to-peer, always-available architecture from Amazon’s Dynamo. The result is a database that provides linear horizontal scalability, tunable consistency, and no single point of failure.
Ring Architecture & Consistent Hashing
Cassandra organises its cluster as a ring (token ring). The ring represents the entire hash space—a continuous range from -263 to +263 - 1 using the Murmur3Partitioner (the default since Cassandra 1.2). Each node in the cluster is assigned one or more tokens that determine which portion of the hash space it owns.
When a write comes in for a partition key like user_1001, Cassandra computes hash(user_1001) to get a token value, then walks the ring clockwise to find the first node whose token is ≥ that value. That node becomes the primary replica. Depending on the replication factor (RF), additional copies are placed on subsequent nodes around the ring.
# How partition key maps to a node
token = murmur3_hash("user_1001") → e.g., 3,450,892,112,843
# Walk ring clockwise:
# Node A owns tokens [-2^63, 1T]
# Node B owns tokens [1T, 3T]
# Node C owns tokens [3T, 5T] ← token 3.45T lands here (primary)
# Node D owns tokens [5T, 7T] ← replica 2 (RF=3)
# Node E owns tokens [7T, 2^63] ← replica 3 (RF=3)
With vnodes (virtual nodes, default: 256 per physical node since Cassandra 4.0 recommends 16), each physical node owns many small, non-contiguous token ranges rather than one large range. This dramatically improves data distribution balance, speeds up bootstrapping new nodes (they stream data from many peers in parallel), and handles heterogeneous hardware gracefully.
Gossip Protocol
Cassandra nodes discover and monitor each other using a gossip protocol—a peer-to-peer epidemic protocol inspired by how rumours spread in social networks. Every second, each node selects 1–3 random peers and exchanges state information (heartbeat counters, load data, token ownership, schema version).
How gossip works in detail:
- SYN: Node A picks a random peer (Node B) and sends its digest—a compact summary of what it knows about every node (node ID + heartbeat generation + version number).
- ACK: Node B compares A’s digest with its own state. It sends back (a) any state that A is missing and (b) a digest of state that B is missing from A.
- ACK2: Node A sends the state that B was missing. After this three-way handshake, both nodes have converged their views.
Gossip converges exponentially: with N nodes, every node learns about a state change within O(log N) gossip rounds—meaning a 1,000-node cluster converges in roughly 10 seconds. The failure detector (Phi Accrual Failure Detector) uses gossip heartbeats to calculate a suspicion level for each peer. When the phi value exceeds the threshold (default: 8), the node is marked DOWN.
cassandra.yaml. Seeds serve as initial contact points for new nodes joining the cluster. They are not special at runtime—they simply help the gossip protocol bootstrap by providing a known starting peer. Best practice: designate 2–3 seeds per data centre.
Cassandra Write Path
Cassandra’s write path is optimised for sequential I/O, making writes extremely fast—typically under 1 ms at the 99th percentile for local writes. Here is the complete journey of a write:
- Client → Coordinator: The client connects to any node in the cluster (the coordinator). The coordinator determines which nodes hold replicas for the partition key using the token ring.
- Coordinator → Replica Nodes: The coordinator sends the write to all replica nodes simultaneously (not sequentially).
- Commit Log (sequential append): Each replica node first appends the mutation to the commit log—an append-only, durable WAL (Write-Ahead Log) on disk. This is sequential I/O, which SSDs and HDDs handle very efficiently. The commit log ensures durability: if the node crashes, replayed on restart.
- Memtable (in-memory write): Simultaneously, the mutation is written to the memtable—an in-memory sorted data structure (backed by a skip list or B-tree). Writes to the memtable are essentially at memory speed.
- Acknowledgement: Once the commit log and memtable writes complete, the replica node sends an ACK back to the coordinator. The coordinator waits for enough ACKs to satisfy the requested consistency level (e.g., QUORUM = RF/2 + 1) and then responds to the client.
- Memtable Flush → SSTable: When the memtable reaches a configurable size threshold (default ~64 MB), it is flushed to disk as an immutable SSTable (Sorted String Table). SSTables are never modified after creation—this is fundamental to the LSM Tree architecture.
▶ Cassandra Write Path
Step through each phase of a Cassandra write. Watch data flow from client to durable storage.
Write Path Performance Characteristics
| Stage | Medium | Latency | Durability |
|---|---|---|---|
| Commit Log append | Disk (sequential) | ~0.1–0.5 ms (SSD) | Durable immediately |
| Memtable insert | Memory | ~0.01 ms | Volatile (backed by commit log) |
| SSTable flush | Disk (sequential) | Background; 50–200 ms per flush | Durable, immutable |
| End-to-end client write | Network + above | 0.5–2 ms (p99, local DC) | Per consistency level |
CQL Write Examples
-- Create a keyspace with NetworkTopologyStrategy (production standard)
CREATE KEYSPACE iot_data
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3,
'eu-west': 3
}
AND durable_writes = true;
-- Create a time-series table for sensor data
CREATE TABLE iot_data.sensor_readings (
sensor_id UUID,
reading_date DATE,
reading_time TIMESTAMP,
temperature DOUBLE,
humidity DOUBLE,
pressure DOUBLE,
battery_pct TINYINT,
PRIMARY KEY ((sensor_id, reading_date), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC)
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'HOURS',
'compaction_window_size': 1
}
AND default_time_to_live = 7776000; -- 90 days TTL
-- Insert a reading (write path is triggered)
INSERT INTO iot_data.sensor_readings
(sensor_id, reading_date, reading_time, temperature, humidity, pressure, battery_pct)
VALUES
(550e8400-e29b-41d4-a716-446655440000, '2026-04-15', '2026-04-15T14:30:00Z', 23.5, 65.2, 1013.25, 87)
USING TTL 7776000;Cassandra Read Path
The read path in Cassandra is significantly more complex than the write path because data may reside across multiple SSTables and the memtable simultaneously. Cassandra employs several layers of caching and indexing to avoid unnecessary disk I/O.
Complete read path, step by step:
- Memtable check: First, Cassandra checks the in-memory memtable. If the partition exists there, the data is returned immediately (merged with any SSTable data if needed).
- Bloom filter: For each SSTable on disk, Cassandra checks its Bloom filter—a space-efficient probabilistic data structure that answers “Is this partition key possibly in this SSTable?” A negative answer is definitive (the key is definitely not in this SSTable), avoiding a disk read. A positive answer might be a false positive (configurable; default ~1% FP rate with
bloom_filter_fp_chance = 0.01). - Key cache: If the Bloom filter says “maybe,” Cassandra checks the key cache—an in-memory LRU cache that maps partition keys to their byte offset in the SSTable index file. A key cache hit skips the next step entirely.
- Partition index: On a key cache miss, Cassandra performs a binary search on the partition index (stored in a separate file per SSTable, or in the trie-based index in Cassandra 4.0+). This gives the offset to the partition summary, which then points to the exact byte offset in the data file.
- Compression offset map: Cassandra data files are stored compressed in 64 KB chunks (configurable). The compression offset map (in memory) translates the partition’s logical offset to the physical offset of the compressed chunk on disk.
- Disk read: Finally, the compressed chunk is read from disk, decompressed, and the requested columns are extracted.
- Merge: If data for this partition exists in multiple SSTables (due to updates), Cassandra merges the results using timestamp-based conflict resolution—the cell with the highest timestamp wins (Last Write Wins, LWW).
▶ Cassandra Read Path
Step through the read path layers. Green boxes are in-memory; blue boxes are on disk. Watch how each layer filters before hitting disk.
Read Path Optimisation Summary
| Layer | Type | Purpose | Hit = Skip To |
|---|---|---|---|
| Memtable | Memory | Latest unflushed writes | Return data directly |
| Bloom Filter | Memory | Eliminate SSTables that definitely don’t have the key | Skip entire SSTable |
| Key Cache | Memory | Cached partition key → data offset mapping | Jump to compression offset |
| Partition Index | Disk* | Binary search for partition offset | Compression offset |
| Compression Offset Map | Memory | Map logical offset to physical chunk | Disk read |
| SSTable Data File | Disk | Actual data (compressed chunks) | Decompress & return |
*Partition index is memory-mapped; hot portions stay in OS page cache.
The LSM Tree: Engine Behind the Scenes
Both Cassandra and HBase are built on the Log-Structured Merge Tree (LSM Tree), a data structure designed to transform random writes into sequential writes at the cost of more complex reads. Understanding the LSM Tree is fundamental to understanding why these databases behave the way they do.
LSM Tree Architecture
An LSM Tree organises data across multiple levels:
- Level 0 — Memtable (in memory): All writes go here first. The memtable is a sorted, mutable data structure (typically a red-black tree, skip list, or B-tree). When it reaches a size threshold, it becomes immutable and a new memtable is created.
- Level 0 — Immutable memtables flushed to SSTables: The immutable memtable is written to disk as a Sorted String Table (SSTable)—a file containing key-value pairs sorted by key. This flush is a single sequential write—no seeking.
- Level 1–N — Compacted SSTables: Over time, SSTables accumulate at Level 0. Compaction merges overlapping SSTables into larger, non-overlapping SSTables at deeper levels. Each level is typically 10× the size of the previous level.
LSM Tree Structure:
┌─────────────────────────────┐
│ MEMTABLE (mutable, RAM) │ ← All writes land here
│ [sorted by key] │
├─────────────────────────────┤
│ IMMUTABLE MEMTABLE (RAM) │ ← Waiting to flush
├─────────────────────────────┤
│ L0: SST₁ SST₂ SST₃ │ ← Recently flushed, may overlap
├─────────────────────────────┤
│ L1: SST_A SST_B │ ← Compacted, non-overlapping
├─────────────────────────────┤
│ L2: SST_X SST_Y SST_Z │ ← Larger, non-overlapping
├─────────────────────────────┤
│ ...more levels... │
└─────────────────────────────┘
Write, Space, and Read Amplification
The LSM Tree introduces three types of amplification that are critical trade-offs:
| Amplification Type | Definition | LSM Tree Impact | B-Tree Comparison |
|---|---|---|---|
| Write Amplification | Ratio of total bytes written to disk vs. bytes written by the user | 10–30× (data is re-written during each compaction level) | 2–5× (WAL + page writes) |
| Space Amplification | Ratio of disk space used vs. actual data size | 1.1–2× (temporary during compaction) | ~1.5× (fragmentation in pages) |
| Read Amplification | Number of disk reads per logical read | 1–5 (Bloom filters mitigate significantly) | 1–3 (B-tree depth) |
Compaction Strategies
Compaction is the most operationally important aspect of running Cassandra (and HBase). It directly controls read latency, write amplification, disk usage, and I/O patterns. Cassandra offers three compaction strategies, each optimised for different workloads.
Size-Tiered Compaction Strategy (STCS)
STCS is the default strategy. It groups SSTables of similar size into “tiers” and merges them when a tier reaches a threshold (default: 4 SSTables of similar size).
How it works:
- SSTables accumulate at each size tier.
- When 4 SSTables of similar size exist (
min_threshold), they are compacted into one larger SSTable. - The larger SSTable moves to the next tier.
- The process repeats at each tier level.
STCS Compaction Tiers:
Tier 1 (small): [64MB] [64MB] [64MB] [64MB] → compact → [256MB]
Tier 2 (medium): [256MB] [256MB] [256MB] [256MB] → compact → [1GB]
Tier 3 (large): [1GB] [1GB] [1GB] [1GB] → compact → [4GB]
| Pros | Cons |
|---|---|
| Lowest write amplification (~5×) | Highest space amplification (up to 2× during major compaction) |
| Good write throughput | Worst read performance (many overlapping SSTables to check) |
| Simple to understand and tune | Obsolete data lingers in large SSTables |
Best for: Write-heavy, insert-only workloads (logging, event streams) where you rarely update or delete data.
Leveled Compaction Strategy (LCS)
LCS organises SSTables into levels where each level is 10× the size of the previous. Within each level, SSTables have non-overlapping key ranges—meaning a read only needs to check one SSTable per level.
How it works:
- Level 0 contains freshly flushed SSTables (may overlap).
- L0 SSTables are merged into Level 1 (non-overlapping, each SSTable ≤ 160 MB by default).
- When Level 1 exceeds 10 ×
sstable_size_in_mb, excess SSTables are compacted with overlapping SSTables in Level 2. - Each level is guaranteed non-overlapping ⇒ at most one SSTable per level is read for a given partition.
LCS Levels:
L0: [SST₁] [SST₂] ← may overlap
L1: [A-D] [E-H] [I-L] [M-P] [Q-T] [U-Z] ← non-overlapping, ≤160MB each
L2: 10× more SSTables, non-overlapping ← ~10× L1 total size
L3: 10× more than L2... ← ~10× L2 total size
| Pros | Cons |
|---|---|
| Best read performance (typically 1–2 SSTables per read) | Highest write amplification (10–30×) |
| Lowest space amplification (~1.1×) | Heavy compaction I/O can starve writes |
| Efficient for read-heavy workloads with updates | Not suitable for pure write workloads |
Best for: Read-heavy workloads with frequent updates and deletes (user profiles, product catalogs, session stores).
Time-Window Compaction Strategy (TWCS)
TWCS is purpose-built for time-series data. It groups SSTables by time window and compacts within each window using STCS. Once a time window closes (all data for that period has been flushed), the resulting SSTable is never compacted again.
How it works:
- Configure a time window (e.g., 1 hour).
- Within the current window, SSTables are compacted using STCS.
- When the window closes, the remaining SSTable(s) for that window are left untouched.
- Old windows are dropped entirely when TTL expires (extremely efficient deletion—just delete the file).
-- TWCS configuration for time-series
ALTER TABLE iot_data.sensor_readings WITH compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'HOURS',
'compaction_window_size': 1,
'timestamp_resolution': 'MICROSECONDS',
'max_threshold': 32,
'min_threshold': 4
};| Pros | Cons |
|---|---|
| Minimal write amplification (compaction only within windows) | Data must be written in approximately time order |
| Efficient TTL expiration (drop entire SSTable) | Terrible for out-of-order or late-arriving data |
| Predictable, bounded compaction I/O | Updates/deletes within past windows are very expensive |
Best for: Time-series, IoT telemetry, metrics, logs—anything with a TTL and naturally time-ordered writes.
Compaction Strategy Decision Matrix
| Workload | Strategy | Reason |
|---|---|---|
| Time-series (IoT, metrics, logs) | TWCS | Time-ordered writes + TTL = efficient window drops |
| Write-heavy, append-only | STCS | Lowest write amplification |
| Read-heavy with updates | LCS | Guaranteed read performance, low space amp |
| Mixed read/write | LCS | Better read latency worth the write amp trade-off |
| Immutable data (CDC events) | STCS | No updates/deletes, so overlapping SSTables are fine |
Tunable Consistency
Cassandra’s killer feature for distributed systems is tunable consistency: you choose the consistency level per query, not per database. This lets you make different trade-offs for different operations within the same application.
Consistency Levels Explained
| Level | Write Behaviour | Read Behaviour | Nodes Required |
|---|---|---|---|
ONE | Wait for 1 replica ACK | Read from 1 replica | 1 of RF |
TWO | Wait for 2 replica ACKs | Read from 2 replicas | 2 of RF |
THREE | Wait for 3 replica ACKs | Read from 3 replicas | 3 of RF |
QUORUM | Wait for ⌊RF/2⌋+1 ACKs | Read from ⌊RF/2⌋+1 replicas | Majority of RF |
LOCAL_QUORUM | Quorum within the local data centre only | Quorum read from local DC only | Majority in local DC |
EACH_QUORUM | Quorum in every data centre | Not supported for reads | Majority in each DC |
ALL | Wait for all RF replicas | Read from all RF replicas | All RF nodes |
ANY | Write succeeds if stored anywhere (even a hinted handoff) | Not supported for reads | 1 (including hints) |
Strong Consistency Formula
Cassandra provides strong consistency (linearisable reads) when:
R + W > RF
Where:
R = read consistency level (number of nodes read)
W = write consistency level (number of nodes written)
RF = replication factor
Example (RF = 3):
QUORUM write (2) + QUORUM read (2) = 4 > 3 ✓ Strong consistency
ONE write (1) + ALL read (3) = 4 > 3 ✓ Strong consistency
ONE write (1) + ONE read (1) = 2 ≤ 3 ✗ Eventual consistency
LOCAL_QUORUM for both reads and writes. This gives strong consistency within a data centre while keeping cross-DC latency off the critical path. With RF=3 per DC, LOCAL_QUORUM requires 2 of 3 local replicas—tolerating 1 node failure per DC without any read/write impact.
-- Per-query consistency level in CQL
CONSISTENCY LOCAL_QUORUM;
SELECT * FROM users WHERE user_id = 'u1001';
-- Or in application code (Java driver):
// Statement stmt = SimpleStatement.newInstance("SELECT ...")
// .setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
-- High availability write (accepts hinted handoff)
CONSISTENCY ANY;
INSERT INTO events (id, ts, data) VALUES (uuid(), toTimestamp(now()), 'payload');
-- Strongest possible read
CONSISTENCY ALL;
SELECT * FROM account_balances WHERE account_id = 'acc_5001';Data Modeling: Query-First Design
Cassandra data modeling is the polar opposite of relational data modeling. In RDBMS, you normalise your data first and figure out queries later. In Cassandra, you start with the queries and design tables to serve each query optimally. This is called query-first design (or query-driven modeling).
The Rules of Cassandra Data Modeling
- No joins. Denormalize everything. If a query needs data from what would be two relational tables, combine them into one Cassandra table.
- No aggregations. Pre-compute aggregates at write time (or use materialised views / Spark for batch aggregation).
- One table per query pattern. If you have 5 different query patterns, you likely need 5 tables, each with a different partition key.
- Minimise partitions read. The ideal query touches exactly one partition. Cross-partition queries (
INon partition key orALLOW FILTERING) are anti-patterns. - Even partition sizes. Aim for partitions ≤ 100 MB and ≤ 100,000 rows. Huge partitions (“wide rows”) cause hotspots and GC pressure.
Partition Keys vs. Clustering Keys
The PRIMARY KEY in CQL has two parts:
PRIMARY KEY ((partition_key_columns), clustering_key_columns)
↑ double parens = composite partition key
-- Example:
PRIMARY KEY ((sensor_id, reading_date), reading_time)
↑ partition key (2 cols) ↑ clustering key
| Key Type | Purpose | Determines | Example |
|---|---|---|---|
| Partition Key | Determines which node stores the data | Data distribution (which node), data locality (what’s stored together) | (sensor_id, reading_date) |
| Clustering Key | Determines sort order within a partition | On-disk sort order, range query capability | reading_time DESC |
Data Modeling Example: E-Commerce
Consider an e-commerce platform with these query patterns:
-- Q1: Get all orders for a user, most recent first
CREATE TABLE orders_by_user (
user_id UUID,
order_date TIMESTAMP,
order_id UUID,
total_amount DECIMAL,
status TEXT,
items FROZEN<LIST<FROZEN<map<TEXT, TEXT>>>>,
PRIMARY KEY ((user_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id ASC);
-- Query: SELECT * FROM orders_by_user WHERE user_id = ?;
-- Reads a single partition, ordered by date DESC. Perfect.
-- Q2: Get order details by order_id (for order status page)
CREATE TABLE orders_by_id (
order_id UUID,
user_id UUID,
order_date TIMESTAMP,
total_amount DECIMAL,
status TEXT,
shipping_addr FROZEN<map<TEXT, TEXT>>,
items FROZEN<LIST<FROZEN<map<TEXT, TEXT>>>>,
PRIMARY KEY ((order_id))
);
-- Query: SELECT * FROM orders_by_id WHERE order_id = ?;
-- Single partition lookup. O(1).
-- Q3: Get all orders with a specific status (for admin dashboard)
CREATE TABLE orders_by_status (
status TEXT,
order_date TIMESTAMP,
order_id UUID,
user_id UUID,
total_amount DECIMAL,
PRIMARY KEY ((status), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id ASC);
-- Query: SELECT * FROM orders_by_status WHERE status = 'shipped'
-- AND order_date > '2026-04-01';
-- Partition by status + range on date. Efficient.
-- ANTI-PATTERN (do NOT do this):
-- SELECT * FROM orders_by_user WHERE status = 'shipped';
-- This requires scanning ALL partitions → ALLOW FILTERING → full table scan!Time-Series Data Modeling Pattern
-- Time-series with bucketed partitions
-- Problem: If partition key is just sensor_id, the partition grows unbounded
-- Solution: Add a time bucket to the partition key
CREATE TABLE metrics (
metric_name TEXT,
bucket TEXT, -- '2026-04-15-14' (hourly bucket)
ts TIMESTAMP,
value DOUBLE,
tags MAP<TEXT, TEXT>,
PRIMARY KEY ((metric_name, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'HOURS',
'compaction_window_size': 1}
AND default_time_to_live = 2592000; -- 30 day retention
-- Query recent metrics:
-- SELECT * FROM metrics
-- WHERE metric_name = 'cpu_usage' AND bucket = '2026-04-15-14'
-- LIMIT 100;
-- Application code generates bucket from timestamp:
-- bucket = timestamp.strftime('%Y-%m-%d-%H')HBase: The Hadoop Column Store
Apache HBase is the other major column-family store. Where Cassandra chose availability and partition tolerance (AP in CAP theorem), HBase chose consistency and partition tolerance (CP). HBase provides strong consistency (linearisable reads/writes) for every operation—no tunable consistency levels, no eventual consistency.
🎯 Cassandra (AP)
- Peer-to-peer, no master node
- Tunable consistency (ONE to ALL)
- Always available (even during partitions)
- Multi-data centre replication built in
- CQL query language (SQL-like)
- Runs standalone (no dependencies)
📚 HBase (CP)
- Master-slave architecture (HMaster + RegionServers)
- Strong consistency (always linearisable)
- Unavailable during master failover
- Tight HDFS integration (Hadoop ecosystem)
- Java API (no SQL-like language natively)
- Requires HDFS + ZooKeeper
HBase Architecture
Key components:
- HMaster: The master server responsible for region assignment, DDL operations (create/delete tables), load balancing of regions across RegionServers, and monitoring. It does not handle data reads/writes—clients go directly to RegionServers.
- RegionServer: The workhorse. Each RegionServer hosts multiple regions. A region is a contiguous range of row keys for a table. Clients read/write directly to the RegionServer that hosts their region. The write path is nearly identical to Cassandra’s: WAL → MemStore → HFile (SSTable equivalent).
- ZooKeeper: Provides distributed coordination. It tracks which RegionServer is alive, which RegionServer hosts which region, and manages HMaster election. ZooKeeper is the control plane; it never handles data.
- HDFS: The underlying distributed file system. HFiles (SSTables), WALs, and all persistent data are stored on HDFS. This gives HBase automatic 3× replication at the storage layer—data durability is delegated to HDFS.
HBase Architecture:
┌───────────────┐
│ Client │
└──────┬────────┘
│ (lookup region location via ZooKeeper + META table)
▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ RegionServer 1│ │ RegionServer 2│ │ RegionServer 3│
│ Region A │ │ Region C │ │ Region E │
│ Region B │ │ Region D │ │ Region F │
├───────────────┤ ├───────────────┤ ├───────────────┤
│ WAL + MemStore │ WAL + MemStore │ WAL + MemStore
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└────────────────────┼────────────────────┘
▼
┌────────────────┐
│ HDFS │ (HFiles, WALs)
│ (3x replicated) │
└────────┬───────┘
│
┌────────────────┐
│ ZooKeeper │ (coordination, META location)
└────────────────┘
│
┌────────────────┐
│ HMaster │ (DDL, region assignment, balancing)
└────────────────┘
HBase Write Path
- Client looks up the
hbase:metatable (cached locally after first lookup) to find which RegionServer hosts the target region. - Client sends the write directly to the RegionServer.
- RegionServer appends to the WAL (Write-Ahead Log) on HDFS for durability.
- RegionServer writes to the MemStore (in-memory, sorted by row key).
- ACK returned to client. The write is durable (WAL on HDFS) and visible (MemStore).
- When MemStore reaches threshold (default 128 MB), it flushes to an HFile on HDFS.
- Compaction merges HFiles (minor: merge small HFiles; major: merge all HFiles for a region).
HBase vs. Cassandra Comparison
| Dimension | Cassandra | HBase |
|---|---|---|
| Consistency | Tunable (eventual to strong) | Always strong (CP) |
| Architecture | Peer-to-peer (masterless) | Master-slave (HMaster) |
| Dependencies | None (standalone JVM) | HDFS + ZooKeeper + HMaster |
| Write throughput | Higher (no WAL replication overhead) | Good but lower (HDFS WAL 3× replication) |
| Read latency | Tunable (faster with CL=ONE) | Consistent (single RegionServer) |
| Multi-DC | First-class support | Via replication (HBase Replication, limited) |
| Hadoop integration | Via Spark connector | Native (sits on HDFS, MapReduce, Spark) |
| Row-level transactions | Lightweight transactions (LWT, Paxos-based) | Native atomic row operations (checkAndPut) |
| Query language | CQL (SQL-like, familiar) | Java API / Apache Phoenix (SQL layer) |
| Ideal for | High-availability, multi-DC, write-heavy | Strong consistency, Hadoop ecosystem, batch analytics |
When to Use Column-Family Stores
Ideal Use Cases
| Use Case | Why Column-Family Excels | Real-World Example |
|---|---|---|
| Time-series data | Sequential writes, natural time-bucketed partitions, TTL for retention, TWCS compaction | Netflix: 2M+ writes/sec of viewing activity data in Cassandra |
| IoT telemetry | Millions of devices, each producing small frequent writes; reads are by device + time range | John Deere: billions of sensor readings from tractors and farm equipment |
| Write-heavy workloads | LSM Tree architecture makes writes O(1) sequential I/O; 100K+ writes/sec per node | Discord: billions of messages stored in Cassandra (later migrated to ScyllaDB) |
| Event sourcing / CDC | Immutable append-only events, natural fit for STCS compaction | Uber: trip events, driver location updates, pricing events |
| User activity / personalisation | Sparse data (each user has different activity patterns), fast lookups by user ID | Spotify: user listening history, recommendations, playlist metadata |
| Messaging / chat | Messages partitioned by conversation, clustered by timestamp for efficient scrollback | Apple iMessage: global scale messaging infrastructure |
When NOT to Use Column-Family Stores
| Scenario | Why Not | Better Alternative |
|---|---|---|
| Complex joins and ad-hoc queries | No join support, query-first design limits flexibility | PostgreSQL, MySQL |
| ACID transactions across rows/tables | Only row-level atomicity; LWT is slow (Paxos round-trips) | PostgreSQL, CockroachDB, Spanner |
| Small datasets (< 10 GB) | Operational overhead of a Cassandra cluster is unjustified for small data | PostgreSQL, SQLite |
| Strong consistency everywhere (finance, banking) | Tunable consistency adds complexity; HBase better if you need Hadoop; Spanner if you need SQL | Spanner, CockroachDB, HBase |
| Graph traversals | No relationship model; each hop requires a separate query | Neo4j, Amazon Neptune |
| Full-text search | No indexing for free-text queries | Elasticsearch, OpenSearch |
Common Interview Questions
How does Cassandra achieve zero single points of failure?
Answer: Cassandra uses a peer-to-peer (masterless) architecture where every node is identical. There is no “master” or “leader” node. Any node can serve any request for any partition (acting as a coordinator). The gossip protocol disseminates cluster topology to every node, so no central registry is needed. Data is replicated to RF nodes, so even if RF−1 nodes go down, all data remains accessible (at appropriate consistency levels). Compare this to HBase, where the HMaster is a potential SPOF (mitigated by standby HMasters, but failover takes 10–30 seconds).
Explain the difference between partition key and clustering key. Why does it matter?
The partition key determines where data lives (which node, via consistent hashing). All rows with the same partition key are stored on the same node(s) and in the same SSTable partition. The clustering key determines the sort order within the partition on disk.
This matters because: (1) queries on the partition key are O(1) lookups, (2) range queries on the clustering key are efficient sequential reads within a single partition, and (3) a poorly chosen partition key creates hotspots (all traffic to one node) or unbounded partitions (single partition grows to gigabytes).
Rule of thumb: Partition key = equality predicate in your WHERE clause. Clustering key = range/ordering predicate.
You need to store 1 billion events/day with 90-day retention. Design the schema.
Calculation: 1B events/day = ~11,500 events/sec average, ~50K peak. Each event ~500 bytes = 500 GB/day raw, ~1.5 TB/day with RF=3.
Schema:
CREATE TABLE events (
event_type TEXT,
day DATE,
hour TINYINT,
event_time TIMESTAMP,
event_id TIMEUUID,
payload BLOB,
PRIMARY KEY ((event_type, day, hour), event_time, event_id)
) WITH CLUSTERING ORDER BY (event_time DESC, event_id ASC)
AND compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'HOURS',
'compaction_window_size': 1}
AND default_time_to_live = 7776000;
Why this design: Partition key includes hour bucket to cap partition size (~21M events per event_type per hour). TWCS for time-window compaction. TTL of 90 days = automatic deletion. Cluster size: ~15–20 nodes with 2 TB NVMe SSDs each.
When would you choose HBase over Cassandra?
Choose HBase when:
- You need strong consistency for every operation (e.g., financial ledger where you can’t tolerate stale reads).
- You’re already invested in the Hadoop ecosystem (HDFS, MapReduce, Hive, Spark) and want tight integration for batch analytics.
- You need atomic row-level operations (HBase’s
checkAndPut/checkAndDeleteare native, not Paxos-based like Cassandra’s LWT). - You don’t need multi-data centre replication (HBase’s cross-DC replication is limited compared to Cassandra).
Choose Cassandra when: multi-DC is required, availability > consistency, or you want a simpler operational model (no HDFS/ZK dependency).
What is a tombstone in Cassandra and why are they dangerous?
Since SSTables are immutable, Cassandra cannot delete data in place. Instead, a delete writes a special marker called a tombstone—a record that says “this cell/row was deleted at timestamp T.” During reads, the tombstone suppresses the older value. Tombstones are removed during compaction after gc_grace_seconds (default: 10 days).
Why dangerous: If you delete frequently (e.g., delete + reinsert pattern, or TTL expiring many cells), tombstones accumulate between compactions. Reads that scan many tombstones waste CPU and I/O, and if a single read encounters > 100,000 tombstones (configurable), Cassandra aborts the query with TombstoneOverwhelmingException. Mitigation: choose the right compaction strategy (TWCS for TTL data), avoid deletes when possible, and monitor nodetool tablestats for tombstone counts.