← All Posts
High Level Design Series· Data Storage · Part 4 of 70

Column-Family Stores (Cassandra, HBase)

The Wide-Column Data Model

A column-family store (also called a wide-column store) organises data into rows and columns—but unlike a relational table, every row can have a different set of columns. There is no fixed schema enforced across all rows. Think of it as a two-dimensional key-value store: the first key identifies the row (partition key), the second key identifies the column, and together they point to a value.

The canonical mental model is a sorted map of sorted maps:

Map<PartitionKey, SortedMap<ColumnKey, Value>>

In practice, the inner map is further grouped into column families. A column family is a logical grouping of columns that are frequently accessed together—similar to tables in RDBMS, but far more flexible. Within a column family, data is stored sorted by the column key, enabling extremely efficient range scans within a single partition.

Column Family vs. Row-Oriented Storage

DimensionRow-Oriented (MySQL, PostgreSQL)Column-Family (Cassandra, HBase)
SchemaFixed schema; every row has the same columnsFlexible; each row can have different columns
Storage layoutEntire row stored contiguously on diskColumns within a family stored together; different families on separate files
Null handlingNULL values consume space or require bitmapMissing columns simply don’t exist—zero storage cost
Read patternEfficient for SELECT * (all columns)Efficient for reading specific column subsets or ranges
ScalingTypically vertical; sharding is manualDesigned for horizontal scaling from day one
JoinsNative JOIN supportNo joins; denormalization is the norm

Conceptual Example

Consider a user profile stored in a column family called users:

Row Key: user_1001
  ├── name: "Alice Chen"
  ├── email: "alice@example.com"
  ├── plan: "premium"
  └── last_login: "2026-04-15T08:30:00Z"

Row Key: user_1002
  ├── name: "Bob Kumar"
  ├── email: "bob@example.com"
  └── phone: "+1-555-0199"        ← column that user_1001 doesn't have
                                     (no plan, no last_login for this row)

Notice that user_1002 has a phone column that user_1001 doesn’t, and lacks plan and last_login. Neither scenario causes errors or wastes storage. This sparse data model is one of the core advantages of wide-column stores for datasets with highly variable attributes.

Historical note: The wide-column model was introduced by Google’s Bigtable paper (2006). The key insight was that real-world data at Google scale—web crawl data, Maps tiles, Gmail—was inherently sparse. Forcing it into rigid relational schemas was both wasteful and operationally painful. Bigtable directly inspired both HBase (open-source Bigtable clone on HDFS) and Cassandra (which combined Bigtable’s data model with Amazon Dynamo’s distribution architecture).

Cassandra Deep Dive: Architecture

Apache Cassandra is a masterclass in distributed systems design. It merges the column-family data model from Bigtable with the peer-to-peer, always-available architecture from Amazon’s Dynamo. The result is a database that provides linear horizontal scalability, tunable consistency, and no single point of failure.

2M+
Writes/sec per cluster (Netflix)
0
Single points of failure
1000+
Nodes in largest clusters (Apple)
PB
Scale of data per cluster

Ring Architecture & Consistent Hashing

Cassandra organises its cluster as a ring (token ring). The ring represents the entire hash space—a continuous range from -263 to +263 - 1 using the Murmur3Partitioner (the default since Cassandra 1.2). Each node in the cluster is assigned one or more tokens that determine which portion of the hash space it owns.

When a write comes in for a partition key like user_1001, Cassandra computes hash(user_1001) to get a token value, then walks the ring clockwise to find the first node whose token is ≥ that value. That node becomes the primary replica. Depending on the replication factor (RF), additional copies are placed on subsequent nodes around the ring.

# How partition key maps to a node
token = murmur3_hash("user_1001")  → e.g., 3,450,892,112,843
# Walk ring clockwise:
#   Node A owns tokens [-2^63, 1T]
#   Node B owns tokens [1T, 3T]
#   Node C owns tokens [3T, 5T]     ← token 3.45T lands here (primary)
#   Node D owns tokens [5T, 7T]     ← replica 2 (RF=3)
#   Node E owns tokens [7T, 2^63]   ← replica 3 (RF=3)

With vnodes (virtual nodes, default: 256 per physical node since Cassandra 4.0 recommends 16), each physical node owns many small, non-contiguous token ranges rather than one large range. This dramatically improves data distribution balance, speeds up bootstrapping new nodes (they stream data from many peers in parallel), and handles heterogeneous hardware gracefully.

Gossip Protocol

Cassandra nodes discover and monitor each other using a gossip protocol—a peer-to-peer epidemic protocol inspired by how rumours spread in social networks. Every second, each node selects 1–3 random peers and exchanges state information (heartbeat counters, load data, token ownership, schema version).

How gossip works in detail:

  1. SYN: Node A picks a random peer (Node B) and sends its digest—a compact summary of what it knows about every node (node ID + heartbeat generation + version number).
  2. ACK: Node B compares A’s digest with its own state. It sends back (a) any state that A is missing and (b) a digest of state that B is missing from A.
  3. ACK2: Node A sends the state that B was missing. After this three-way handshake, both nodes have converged their views.

Gossip converges exponentially: with N nodes, every node learns about a state change within O(log N) gossip rounds—meaning a 1,000-node cluster converges in roughly 10 seconds. The failure detector (Phi Accrual Failure Detector) uses gossip heartbeats to calculate a suspicion level for each peer. When the phi value exceeds the threshold (default: 8), the node is marked DOWN.

Seed nodes: Cassandra requires at least one seed node configured in cassandra.yaml. Seeds serve as initial contact points for new nodes joining the cluster. They are not special at runtime—they simply help the gossip protocol bootstrap by providing a known starting peer. Best practice: designate 2–3 seeds per data centre.

Cassandra Write Path

Cassandra’s write path is optimised for sequential I/O, making writes extremely fast—typically under 1 ms at the 99th percentile for local writes. Here is the complete journey of a write:

  1. Client → Coordinator: The client connects to any node in the cluster (the coordinator). The coordinator determines which nodes hold replicas for the partition key using the token ring.
  2. Coordinator → Replica Nodes: The coordinator sends the write to all replica nodes simultaneously (not sequentially).
  3. Commit Log (sequential append): Each replica node first appends the mutation to the commit log—an append-only, durable WAL (Write-Ahead Log) on disk. This is sequential I/O, which SSDs and HDDs handle very efficiently. The commit log ensures durability: if the node crashes, replayed on restart.
  4. Memtable (in-memory write): Simultaneously, the mutation is written to the memtable—an in-memory sorted data structure (backed by a skip list or B-tree). Writes to the memtable are essentially at memory speed.
  5. Acknowledgement: Once the commit log and memtable writes complete, the replica node sends an ACK back to the coordinator. The coordinator waits for enough ACKs to satisfy the requested consistency level (e.g., QUORUM = RF/2 + 1) and then responds to the client.
  6. Memtable Flush → SSTable: When the memtable reaches a configurable size threshold (default ~64 MB), it is flushed to disk as an immutable SSTable (Sorted String Table). SSTables are never modified after creation—this is fundamental to the LSM Tree architecture.

▶ Cassandra Write Path

Step through each phase of a Cassandra write. Watch data flow from client to durable storage.

Write Path Performance Characteristics

StageMediumLatencyDurability
Commit Log appendDisk (sequential)~0.1–0.5 ms (SSD)Durable immediately
Memtable insertMemory~0.01 msVolatile (backed by commit log)
SSTable flushDisk (sequential)Background; 50–200 ms per flushDurable, immutable
End-to-end client writeNetwork + above0.5–2 ms (p99, local DC)Per consistency level
Why writes are fast: Both the commit log and SSTable flush are sequential writes. There are zero random disk seeks on the write path. Compare this to a B-tree database (PostgreSQL, MySQL) where an INSERT may trigger random I/O to update index pages. This is why Cassandra consistently outperforms relational databases on write throughput by 5–10× at scale.

CQL Write Examples

-- Create a keyspace with NetworkTopologyStrategy (production standard)
CREATE KEYSPACE iot_data
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east': 3,
  'eu-west': 3
}
AND durable_writes = true;

-- Create a time-series table for sensor data
CREATE TABLE iot_data.sensor_readings (
  sensor_id    UUID,
  reading_date DATE,
  reading_time TIMESTAMP,
  temperature  DOUBLE,
  humidity     DOUBLE,
  pressure     DOUBLE,
  battery_pct  TINYINT,
  PRIMARY KEY ((sensor_id, reading_date), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC)
  AND compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'HOURS',
    'compaction_window_size': 1
  }
  AND default_time_to_live = 7776000;  -- 90 days TTL

-- Insert a reading (write path is triggered)
INSERT INTO iot_data.sensor_readings
  (sensor_id, reading_date, reading_time, temperature, humidity, pressure, battery_pct)
VALUES
  (550e8400-e29b-41d4-a716-446655440000, '2026-04-15', '2026-04-15T14:30:00Z', 23.5, 65.2, 1013.25, 87)
USING TTL 7776000;

Cassandra Read Path

The read path in Cassandra is significantly more complex than the write path because data may reside across multiple SSTables and the memtable simultaneously. Cassandra employs several layers of caching and indexing to avoid unnecessary disk I/O.

Complete read path, step by step:

  1. Memtable check: First, Cassandra checks the in-memory memtable. If the partition exists there, the data is returned immediately (merged with any SSTable data if needed).
  2. Bloom filter: For each SSTable on disk, Cassandra checks its Bloom filter—a space-efficient probabilistic data structure that answers “Is this partition key possibly in this SSTable?” A negative answer is definitive (the key is definitely not in this SSTable), avoiding a disk read. A positive answer might be a false positive (configurable; default ~1% FP rate with bloom_filter_fp_chance = 0.01).
  3. Key cache: If the Bloom filter says “maybe,” Cassandra checks the key cache—an in-memory LRU cache that maps partition keys to their byte offset in the SSTable index file. A key cache hit skips the next step entirely.
  4. Partition index: On a key cache miss, Cassandra performs a binary search on the partition index (stored in a separate file per SSTable, or in the trie-based index in Cassandra 4.0+). This gives the offset to the partition summary, which then points to the exact byte offset in the data file.
  5. Compression offset map: Cassandra data files are stored compressed in 64 KB chunks (configurable). The compression offset map (in memory) translates the partition’s logical offset to the physical offset of the compressed chunk on disk.
  6. Disk read: Finally, the compressed chunk is read from disk, decompressed, and the requested columns are extracted.
  7. Merge: If data for this partition exists in multiple SSTables (due to updates), Cassandra merges the results using timestamp-based conflict resolution—the cell with the highest timestamp wins (Last Write Wins, LWW).

▶ Cassandra Read Path

Step through the read path layers. Green boxes are in-memory; blue boxes are on disk. Watch how each layer filters before hitting disk.

Read Path Optimisation Summary

LayerTypePurposeHit = Skip To
MemtableMemoryLatest unflushed writesReturn data directly
Bloom FilterMemoryEliminate SSTables that definitely don’t have the keySkip entire SSTable
Key CacheMemoryCached partition key → data offset mappingJump to compression offset
Partition IndexDisk*Binary search for partition offsetCompression offset
Compression Offset MapMemoryMap logical offset to physical chunkDisk read
SSTable Data FileDiskActual data (compressed chunks)Decompress & return

*Partition index is memory-mapped; hot portions stay in OS page cache.

Read amplification: In the worst case, reading a single partition might require checking Bloom filters on every SSTable, falling through to disk reads on multiple SSTables, and merging the results. This is read amplification—and it’s the primary reason compaction exists. Compaction merges SSTables, reducing the number a read must consult.

The LSM Tree: Engine Behind the Scenes

Both Cassandra and HBase are built on the Log-Structured Merge Tree (LSM Tree), a data structure designed to transform random writes into sequential writes at the cost of more complex reads. Understanding the LSM Tree is fundamental to understanding why these databases behave the way they do.

LSM Tree Architecture

An LSM Tree organises data across multiple levels:

  1. Level 0 — Memtable (in memory): All writes go here first. The memtable is a sorted, mutable data structure (typically a red-black tree, skip list, or B-tree). When it reaches a size threshold, it becomes immutable and a new memtable is created.
  2. Level 0 — Immutable memtables flushed to SSTables: The immutable memtable is written to disk as a Sorted String Table (SSTable)—a file containing key-value pairs sorted by key. This flush is a single sequential write—no seeking.
  3. Level 1–N — Compacted SSTables: Over time, SSTables accumulate at Level 0. Compaction merges overlapping SSTables into larger, non-overlapping SSTables at deeper levels. Each level is typically 10× the size of the previous level.
LSM Tree Structure:
┌─────────────────────────────┐
│  MEMTABLE (mutable, RAM)    │  ← All writes land here
│  [sorted by key]            │
├─────────────────────────────┤
│  IMMUTABLE MEMTABLE (RAM)   │  ← Waiting to flush
├─────────────────────────────┤
│  L0: SST₁  SST₂  SST₃     │  ← Recently flushed, may overlap
├─────────────────────────────┤
│  L1: SST_A   SST_B         │  ← Compacted, non-overlapping
├─────────────────────────────┤
│  L2: SST_X  SST_Y  SST_Z   │  ← Larger, non-overlapping
├─────────────────────────────┤
│  ...more levels...          │
└─────────────────────────────┘

Write, Space, and Read Amplification

The LSM Tree introduces three types of amplification that are critical trade-offs:

Amplification TypeDefinitionLSM Tree ImpactB-Tree Comparison
Write Amplification Ratio of total bytes written to disk vs. bytes written by the user 10–30× (data is re-written during each compaction level) 2–5× (WAL + page writes)
Space Amplification Ratio of disk space used vs. actual data size 1.1–2× (temporary during compaction) ~1.5× (fragmentation in pages)
Read Amplification Number of disk reads per logical read 1–5 (Bloom filters mitigate significantly) 1–3 (B-tree depth)
The fundamental trade-off: LSM Trees sacrifice read performance and write amplification to gain dramatically better write throughput. For workloads that are 80%+ writes (which describes time-series, logging, and IoT perfectly), this trade-off is overwhelmingly favourable. A single Cassandra node can sustain 100,000+ writes/sec where a comparable PostgreSQL instance might top out at 10,000–20,000.

Compaction Strategies

Compaction is the most operationally important aspect of running Cassandra (and HBase). It directly controls read latency, write amplification, disk usage, and I/O patterns. Cassandra offers three compaction strategies, each optimised for different workloads.

Size-Tiered Compaction Strategy (STCS)

STCS is the default strategy. It groups SSTables of similar size into “tiers” and merges them when a tier reaches a threshold (default: 4 SSTables of similar size).

How it works:

  1. SSTables accumulate at each size tier.
  2. When 4 SSTables of similar size exist (min_threshold), they are compacted into one larger SSTable.
  3. The larger SSTable moves to the next tier.
  4. The process repeats at each tier level.
STCS Compaction Tiers:
  Tier 1 (small):   [64MB] [64MB] [64MB] [64MB]  → compact → [256MB]
  Tier 2 (medium):  [256MB] [256MB] [256MB] [256MB] → compact → [1GB]
  Tier 3 (large):   [1GB] [1GB] [1GB] [1GB] → compact → [4GB]
ProsCons
Lowest write amplification (~5×)Highest space amplification (up to 2× during major compaction)
Good write throughputWorst read performance (many overlapping SSTables to check)
Simple to understand and tuneObsolete data lingers in large SSTables

Best for: Write-heavy, insert-only workloads (logging, event streams) where you rarely update or delete data.

Leveled Compaction Strategy (LCS)

LCS organises SSTables into levels where each level is 10× the size of the previous. Within each level, SSTables have non-overlapping key ranges—meaning a read only needs to check one SSTable per level.

How it works:

  1. Level 0 contains freshly flushed SSTables (may overlap).
  2. L0 SSTables are merged into Level 1 (non-overlapping, each SSTable ≤ 160 MB by default).
  3. When Level 1 exceeds 10 × sstable_size_in_mb, excess SSTables are compacted with overlapping SSTables in Level 2.
  4. Each level is guaranteed non-overlapping ⇒ at most one SSTable per level is read for a given partition.
LCS Levels:
  L0: [SST₁] [SST₂]                              ← may overlap
  L1: [A-D] [E-H] [I-L] [M-P] [Q-T] [U-Z]       ← non-overlapping, ≤160MB each
  L2: 10× more SSTables, non-overlapping           ← ~10× L1 total size
  L3: 10× more than L2...                          ← ~10× L2 total size
ProsCons
Best read performance (typically 1–2 SSTables per read)Highest write amplification (10–30×)
Lowest space amplification (~1.1×)Heavy compaction I/O can starve writes
Efficient for read-heavy workloads with updatesNot suitable for pure write workloads

Best for: Read-heavy workloads with frequent updates and deletes (user profiles, product catalogs, session stores).

Time-Window Compaction Strategy (TWCS)

TWCS is purpose-built for time-series data. It groups SSTables by time window and compacts within each window using STCS. Once a time window closes (all data for that period has been flushed), the resulting SSTable is never compacted again.

How it works:

  1. Configure a time window (e.g., 1 hour).
  2. Within the current window, SSTables are compacted using STCS.
  3. When the window closes, the remaining SSTable(s) for that window are left untouched.
  4. Old windows are dropped entirely when TTL expires (extremely efficient deletion—just delete the file).
-- TWCS configuration for time-series
ALTER TABLE iot_data.sensor_readings WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'HOURS',
  'compaction_window_size': 1,
  'timestamp_resolution': 'MICROSECONDS',
  'max_threshold': 32,
  'min_threshold': 4
};
ProsCons
Minimal write amplification (compaction only within windows)Data must be written in approximately time order
Efficient TTL expiration (drop entire SSTable)Terrible for out-of-order or late-arriving data
Predictable, bounded compaction I/OUpdates/deletes within past windows are very expensive

Best for: Time-series, IoT telemetry, metrics, logs—anything with a TTL and naturally time-ordered writes.

Compaction Strategy Decision Matrix

WorkloadStrategyReason
Time-series (IoT, metrics, logs)TWCSTime-ordered writes + TTL = efficient window drops
Write-heavy, append-onlySTCSLowest write amplification
Read-heavy with updatesLCSGuaranteed read performance, low space amp
Mixed read/writeLCSBetter read latency worth the write amp trade-off
Immutable data (CDC events)STCSNo updates/deletes, so overlapping SSTables are fine

Tunable Consistency

Cassandra’s killer feature for distributed systems is tunable consistency: you choose the consistency level per query, not per database. This lets you make different trade-offs for different operations within the same application.

Consistency Levels Explained

LevelWrite BehaviourRead BehaviourNodes Required
ONEWait for 1 replica ACKRead from 1 replica1 of RF
TWOWait for 2 replica ACKsRead from 2 replicas2 of RF
THREEWait for 3 replica ACKsRead from 3 replicas3 of RF
QUORUMWait for ⌊RF/2⌋+1 ACKsRead from ⌊RF/2⌋+1 replicasMajority of RF
LOCAL_QUORUMQuorum within the local data centre onlyQuorum read from local DC onlyMajority in local DC
EACH_QUORUMQuorum in every data centreNot supported for readsMajority in each DC
ALLWait for all RF replicasRead from all RF replicasAll RF nodes
ANYWrite succeeds if stored anywhere (even a hinted handoff)Not supported for reads1 (including hints)

Strong Consistency Formula

Cassandra provides strong consistency (linearisable reads) when:

  R + W > RF

  Where:
    R  = read consistency level (number of nodes read)
    W  = write consistency level (number of nodes written)
    RF = replication factor

  Example (RF = 3):
    QUORUM write (2) + QUORUM read (2) = 4 > 3  ✓ Strong consistency
    ONE write (1) + ALL read (3) = 4 > 3         ✓ Strong consistency
    ONE write (1) + ONE read (1) = 2 ≤ 3         ✗ Eventual consistency
Production pattern: Most Cassandra deployments use LOCAL_QUORUM for both reads and writes. This gives strong consistency within a data centre while keeping cross-DC latency off the critical path. With RF=3 per DC, LOCAL_QUORUM requires 2 of 3 local replicas—tolerating 1 node failure per DC without any read/write impact.
-- Per-query consistency level in CQL
CONSISTENCY LOCAL_QUORUM;
SELECT * FROM users WHERE user_id = 'u1001';

-- Or in application code (Java driver):
// Statement stmt = SimpleStatement.newInstance("SELECT ...")
//     .setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);

-- High availability write (accepts hinted handoff)
CONSISTENCY ANY;
INSERT INTO events (id, ts, data) VALUES (uuid(), toTimestamp(now()), 'payload');

-- Strongest possible read
CONSISTENCY ALL;
SELECT * FROM account_balances WHERE account_id = 'acc_5001';

Data Modeling: Query-First Design

Cassandra data modeling is the polar opposite of relational data modeling. In RDBMS, you normalise your data first and figure out queries later. In Cassandra, you start with the queries and design tables to serve each query optimally. This is called query-first design (or query-driven modeling).

The Rules of Cassandra Data Modeling

  1. No joins. Denormalize everything. If a query needs data from what would be two relational tables, combine them into one Cassandra table.
  2. No aggregations. Pre-compute aggregates at write time (or use materialised views / Spark for batch aggregation).
  3. One table per query pattern. If you have 5 different query patterns, you likely need 5 tables, each with a different partition key.
  4. Minimise partitions read. The ideal query touches exactly one partition. Cross-partition queries (IN on partition key or ALLOW FILTERING) are anti-patterns.
  5. Even partition sizes. Aim for partitions ≤ 100 MB and ≤ 100,000 rows. Huge partitions (“wide rows”) cause hotspots and GC pressure.

Partition Keys vs. Clustering Keys

The PRIMARY KEY in CQL has two parts:

PRIMARY KEY ((partition_key_columns), clustering_key_columns)
              ↑ double parens = composite partition key

-- Example:
PRIMARY KEY ((sensor_id, reading_date), reading_time)
             ↑ partition key (2 cols)   ↑ clustering key
Key TypePurposeDeterminesExample
Partition Key Determines which node stores the data Data distribution (which node), data locality (what’s stored together) (sensor_id, reading_date)
Clustering Key Determines sort order within a partition On-disk sort order, range query capability reading_time DESC

Data Modeling Example: E-Commerce

Consider an e-commerce platform with these query patterns:

-- Q1: Get all orders for a user, most recent first
CREATE TABLE orders_by_user (
  user_id       UUID,
  order_date    TIMESTAMP,
  order_id      UUID,
  total_amount  DECIMAL,
  status        TEXT,
  items         FROZEN<LIST<FROZEN<map<TEXT, TEXT>>>>,
  PRIMARY KEY ((user_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id ASC);

-- Query: SELECT * FROM orders_by_user WHERE user_id = ?;
-- Reads a single partition, ordered by date DESC. Perfect.

-- Q2: Get order details by order_id (for order status page)
CREATE TABLE orders_by_id (
  order_id      UUID,
  user_id       UUID,
  order_date    TIMESTAMP,
  total_amount  DECIMAL,
  status        TEXT,
  shipping_addr FROZEN<map<TEXT, TEXT>>,
  items         FROZEN<LIST<FROZEN<map<TEXT, TEXT>>>>,
  PRIMARY KEY ((order_id))
);

-- Query: SELECT * FROM orders_by_id WHERE order_id = ?;
-- Single partition lookup. O(1).

-- Q3: Get all orders with a specific status (for admin dashboard)
CREATE TABLE orders_by_status (
  status        TEXT,
  order_date    TIMESTAMP,
  order_id      UUID,
  user_id       UUID,
  total_amount  DECIMAL,
  PRIMARY KEY ((status), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id ASC);

-- Query: SELECT * FROM orders_by_status WHERE status = 'shipped'
--        AND order_date > '2026-04-01';
-- Partition by status + range on date. Efficient.

-- ANTI-PATTERN (do NOT do this):
-- SELECT * FROM orders_by_user WHERE status = 'shipped';
-- This requires scanning ALL partitions → ALLOW FILTERING → full table scan!
Denormalization cost: Yes, this means storing order data in 3 tables. You write each order 3 times. With RF=3, that’s 9 physical writes per order. This feels wasteful coming from an RDBMS mindset, but disk is cheap and Cassandra writes are essentially free (sub-millisecond). The trade-off is guaranteed sub-5ms reads at any scale. At companies like Netflix and Uber, this is an easy trade-off.

Time-Series Data Modeling Pattern

-- Time-series with bucketed partitions
-- Problem: If partition key is just sensor_id, the partition grows unbounded
-- Solution: Add a time bucket to the partition key

CREATE TABLE metrics (
  metric_name  TEXT,
  bucket       TEXT,           -- '2026-04-15-14' (hourly bucket)
  ts           TIMESTAMP,
  value        DOUBLE,
  tags         MAP<TEXT, TEXT>,
  PRIMARY KEY ((metric_name, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'HOURS',
                    'compaction_window_size': 1}
  AND default_time_to_live = 2592000;  -- 30 day retention

-- Query recent metrics:
-- SELECT * FROM metrics
--   WHERE metric_name = 'cpu_usage' AND bucket = '2026-04-15-14'
--   LIMIT 100;

-- Application code generates bucket from timestamp:
-- bucket = timestamp.strftime('%Y-%m-%d-%H')

HBase: The Hadoop Column Store

Apache HBase is the other major column-family store. Where Cassandra chose availability and partition tolerance (AP in CAP theorem), HBase chose consistency and partition tolerance (CP). HBase provides strong consistency (linearisable reads/writes) for every operation—no tunable consistency levels, no eventual consistency.

🎯 Cassandra (AP)

  • Peer-to-peer, no master node
  • Tunable consistency (ONE to ALL)
  • Always available (even during partitions)
  • Multi-data centre replication built in
  • CQL query language (SQL-like)
  • Runs standalone (no dependencies)

📚 HBase (CP)

  • Master-slave architecture (HMaster + RegionServers)
  • Strong consistency (always linearisable)
  • Unavailable during master failover
  • Tight HDFS integration (Hadoop ecosystem)
  • Java API (no SQL-like language natively)
  • Requires HDFS + ZooKeeper

HBase Architecture

Key components:

HBase Architecture:
┌───────────────┐
│    Client      │
└──────┬────────┘
       │ (lookup region location via ZooKeeper + META table)
       ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│ RegionServer 1│    │ RegionServer 2│    │ RegionServer 3│
│  Region A     │    │  Region C     │    │  Region E     │
│  Region B     │    │  Region D     │    │  Region F     │
├───────────────┤    ├───────────────┤    ├───────────────┤
│  WAL + MemStore    │  WAL + MemStore    │  WAL + MemStore
└───────┬───────┘    └───────┬───────┘    └───────┬───────┘
        │                    │                    │
        └────────────────────┼────────────────────┘
                             ▼
                    ┌────────────────┐
                    │     HDFS       │ (HFiles, WALs)
                    │  (3x replicated) │
                    └────────┬───────┘
                             │
                    ┌────────────────┐
                    │   ZooKeeper    │ (coordination, META location)
                    └────────────────┘
                             │
                    ┌────────────────┐
                    │    HMaster     │ (DDL, region assignment, balancing)
                    └────────────────┘

HBase Write Path

  1. Client looks up the hbase:meta table (cached locally after first lookup) to find which RegionServer hosts the target region.
  2. Client sends the write directly to the RegionServer.
  3. RegionServer appends to the WAL (Write-Ahead Log) on HDFS for durability.
  4. RegionServer writes to the MemStore (in-memory, sorted by row key).
  5. ACK returned to client. The write is durable (WAL on HDFS) and visible (MemStore).
  6. When MemStore reaches threshold (default 128 MB), it flushes to an HFile on HDFS.
  7. Compaction merges HFiles (minor: merge small HFiles; major: merge all HFiles for a region).

HBase vs. Cassandra Comparison

DimensionCassandraHBase
ConsistencyTunable (eventual to strong)Always strong (CP)
ArchitecturePeer-to-peer (masterless)Master-slave (HMaster)
DependenciesNone (standalone JVM)HDFS + ZooKeeper + HMaster
Write throughputHigher (no WAL replication overhead)Good but lower (HDFS WAL 3× replication)
Read latencyTunable (faster with CL=ONE)Consistent (single RegionServer)
Multi-DCFirst-class supportVia replication (HBase Replication, limited)
Hadoop integrationVia Spark connectorNative (sits on HDFS, MapReduce, Spark)
Row-level transactionsLightweight transactions (LWT, Paxos-based)Native atomic row operations (checkAndPut)
Query languageCQL (SQL-like, familiar)Java API / Apache Phoenix (SQL layer)
Ideal forHigh-availability, multi-DC, write-heavyStrong consistency, Hadoop ecosystem, batch analytics

When to Use Column-Family Stores

Ideal Use Cases

Use CaseWhy Column-Family ExcelsReal-World Example
Time-series data Sequential writes, natural time-bucketed partitions, TTL for retention, TWCS compaction Netflix: 2M+ writes/sec of viewing activity data in Cassandra
IoT telemetry Millions of devices, each producing small frequent writes; reads are by device + time range John Deere: billions of sensor readings from tractors and farm equipment
Write-heavy workloads LSM Tree architecture makes writes O(1) sequential I/O; 100K+ writes/sec per node Discord: billions of messages stored in Cassandra (later migrated to ScyllaDB)
Event sourcing / CDC Immutable append-only events, natural fit for STCS compaction Uber: trip events, driver location updates, pricing events
User activity / personalisation Sparse data (each user has different activity patterns), fast lookups by user ID Spotify: user listening history, recommendations, playlist metadata
Messaging / chat Messages partitioned by conversation, clustered by timestamp for efficient scrollback Apple iMessage: global scale messaging infrastructure

When NOT to Use Column-Family Stores

ScenarioWhy NotBetter Alternative
Complex joins and ad-hoc queriesNo join support, query-first design limits flexibilityPostgreSQL, MySQL
ACID transactions across rows/tablesOnly row-level atomicity; LWT is slow (Paxos round-trips)PostgreSQL, CockroachDB, Spanner
Small datasets (< 10 GB)Operational overhead of a Cassandra cluster is unjustified for small dataPostgreSQL, SQLite
Strong consistency everywhere (finance, banking)Tunable consistency adds complexity; HBase better if you need Hadoop; Spanner if you need SQLSpanner, CockroachDB, HBase
Graph traversalsNo relationship model; each hop requires a separate queryNeo4j, Amazon Neptune
Full-text searchNo indexing for free-text queriesElasticsearch, OpenSearch

Common Interview Questions

How does Cassandra achieve zero single points of failure?

Answer: Cassandra uses a peer-to-peer (masterless) architecture where every node is identical. There is no “master” or “leader” node. Any node can serve any request for any partition (acting as a coordinator). The gossip protocol disseminates cluster topology to every node, so no central registry is needed. Data is replicated to RF nodes, so even if RF−1 nodes go down, all data remains accessible (at appropriate consistency levels). Compare this to HBase, where the HMaster is a potential SPOF (mitigated by standby HMasters, but failover takes 10–30 seconds).

Explain the difference between partition key and clustering key. Why does it matter?

The partition key determines where data lives (which node, via consistent hashing). All rows with the same partition key are stored on the same node(s) and in the same SSTable partition. The clustering key determines the sort order within the partition on disk.

This matters because: (1) queries on the partition key are O(1) lookups, (2) range queries on the clustering key are efficient sequential reads within a single partition, and (3) a poorly chosen partition key creates hotspots (all traffic to one node) or unbounded partitions (single partition grows to gigabytes).

Rule of thumb: Partition key = equality predicate in your WHERE clause. Clustering key = range/ordering predicate.

You need to store 1 billion events/day with 90-day retention. Design the schema.

Calculation: 1B events/day = ~11,500 events/sec average, ~50K peak. Each event ~500 bytes = 500 GB/day raw, ~1.5 TB/day with RF=3.

Schema:

CREATE TABLE events (
  event_type TEXT,
  day        DATE,
  hour       TINYINT,
  event_time TIMESTAMP,
  event_id   TIMEUUID,
  payload    BLOB,
  PRIMARY KEY ((event_type, day, hour), event_time, event_id)
) WITH CLUSTERING ORDER BY (event_time DESC, event_id ASC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'HOURS',
                    'compaction_window_size': 1}
  AND default_time_to_live = 7776000;

Why this design: Partition key includes hour bucket to cap partition size (~21M events per event_type per hour). TWCS for time-window compaction. TTL of 90 days = automatic deletion. Cluster size: ~15–20 nodes with 2 TB NVMe SSDs each.

When would you choose HBase over Cassandra?

Choose HBase when:

  1. You need strong consistency for every operation (e.g., financial ledger where you can’t tolerate stale reads).
  2. You’re already invested in the Hadoop ecosystem (HDFS, MapReduce, Hive, Spark) and want tight integration for batch analytics.
  3. You need atomic row-level operations (HBase’s checkAndPut / checkAndDelete are native, not Paxos-based like Cassandra’s LWT).
  4. You don’t need multi-data centre replication (HBase’s cross-DC replication is limited compared to Cassandra).

Choose Cassandra when: multi-DC is required, availability > consistency, or you want a simpler operational model (no HDFS/ZK dependency).

What is a tombstone in Cassandra and why are they dangerous?

Since SSTables are immutable, Cassandra cannot delete data in place. Instead, a delete writes a special marker called a tombstone—a record that says “this cell/row was deleted at timestamp T.” During reads, the tombstone suppresses the older value. Tombstones are removed during compaction after gc_grace_seconds (default: 10 days).

Why dangerous: If you delete frequently (e.g., delete + reinsert pattern, or TTL expiring many cells), tombstones accumulate between compactions. Reads that scan many tombstones waste CPU and I/O, and if a single read encounters > 100,000 tombstones (configurable), Cassandra aborts the query with TombstoneOverwhelmingException. Mitigation: choose the right compaction strategy (TWCS for TTL data), avoid deletes when possible, and monitor nodetool tablestats for tombstone counts.