High Level Design Series · Data Storage · Part 6

Time-Series Databases

April 2026 · 16 min read

What Is Time-Series Data?

Time-series data is a sequence of data points indexed by time. Every measurement — a temperature reading, a stock price, a CPU utilization metric — is a point on a timeline. What makes time-series data fundamentally different from relational or document data isn't the presence of a timestamp column; it's the access pattern and write behavior that accompany it.

Core Characteristics

Time-series workloads share four properties that make general-purpose databases struggle:

Append-only writes. Data is almost never updated or deleted. A temperature reading of 23.4°C at 14:05:03 UTC is an immutable fact. You don't go back and "edit" yesterday's CPU utilization. This means the database can optimize heavily for sequential writes and skip the complexity of in-place updates.
Timestamp-ordered ingestion. Data arrives roughly in chronological order. A sensor sends readings at 1-second intervals; the next reading always has a later timestamp. This natural ordering enables block-based compression and range-scan optimizations that random-access workloads can't exploit.
Extreme write throughput. A single IoT deployment with 100,000 sensors reporting every second generates 100,000 writes/second — 8.64 billion data points per day. A metrics monitoring platform for a 10,000-node Kubernetes cluster easily generates 1 million metrics/second. Traditional RDBMS row-based inserts collapse under this load.
Time-range queries dominate. Users almost always query by time window: "Show me CPU usage for the last 6 hours," "What was the average temperature between 2 PM and 4 PM yesterday?" Range scans on the time axis account for 90%+ of all queries.

Why Not Just Use PostgreSQL?

A fair question. PostgreSQL is excellent — and TimescaleDB proves you can build a competitive TSDB on top of it. But a vanilla PostgreSQL instance struggles with time-series workloads for specific reasons:

Challenge	PostgreSQL Behavior	TSDB Optimization
Write amplification	Each INSERT updates the heap, WAL, and all indexes. B-tree index maintenance is O(log n) per row.	LSM-tree or append-only storage: O(1) amortized per write. Batch inserts into sorted runs.
Table bloat	MVCC creates dead tuples that require VACUUM. At 100K writes/sec, autovacuum can't keep up.	Immutable segments eliminate dead tuples entirely. No vacuum needed.
Index size	B-tree on timestamp column grows unbounded. A 1-year index on 1-second data = 31.5M entries.	Block-level min/max indexes. Skip entire blocks that don't match the time range.
Compression	TOAST compresses individual values, but row-level storage limits column-level compression ratios.	Columnar layout + domain-specific codecs (Gorilla, delta-of-delta) achieve 10–20× compression.
Retention	DELETE FROM ... WHERE ts < '2024-01-01' generates massive WAL, locks rows, triggers vacuum.	DROP CHUNK or drop entire partition file in O(1). No row-level delete overhead.

The Universal Time-Series Data Model

Despite differences in query languages, all TSDBs share a conceptual data model:

  Metric Name     Tags / Labels              Timestamp            Value(s)
  ───────────     ─────────────              ─────────            ────────
  cpu_usage       host=web-01, region=us-e   2026-04-15T14:05:03  72.4
  cpu_usage       host=web-02, region=us-e   2026-04-15T14:05:03  58.1
  temperature     sensor=A7, building=HQ     2026-04-15T14:05:03  23.4
  stock_price     symbol=AAPL, exchange=NQ   2026-04-15T14:05:03  178.52

Metric name: What you're measuring (cpu_usage, temperature, request_latency).
Tags / labels: Key-value pairs that identify the source. These are indexed and used for filtering and grouping. High-cardinality tags (like user_id) are a common pitfall — they explode the number of unique time series.
Timestamp: Typically nanosecond or millisecond precision. Stored as int64 epoch for efficient compression.
Field values: The actual measurements. Can be float64, int64, string, or boolean. Fields are not indexed — only filtered via full scans within a time range.

Cardinality trap: A series is defined by its unique combination of metric name + tag set. If you have 10 hosts × 5 regions × 20 metrics, you have 1,000 unique series. Adding user_id with 1M unique values explodes that to 1 billion series. High cardinality destroys TSDB performance because each series needs its own index entry and storage block.

InfluxDB

InfluxDB is the most widely deployed purpose-built time-series database. Written in Go, it's designed from the ground up for time-series workloads with no dependencies on an external storage engine. The core innovation is the Time-Structured Merge (TSM) tree — a variant of LSM-trees optimized for time-ordered data.

Architecture Overview

  ┌─────────────────────────────────────────────────────┐
  │                  InfluxDB Instance                   │
  │                                                     │
  │  ┌─────────┐   ┌─────────────────┐   ┌──────────┐  │
  │  │  HTTP    │──▶│   Write Path    │──▶│   WAL    │  │
  │  │  API     │   │  (Line Protocol)│   │ (append) │  │
  │  └─────────┘   └────────┬────────┘   └──────────┘  │
  │                         │                           │
  │                         ▼                           │
  │                 ┌───────────────┐                   │
  │                 │   In-Memory   │                   │
  │                 │    Cache      │                   │
  │                 │  (sorted map) │                   │
  │                 └───────┬───────┘                   │
  │                         │ flush (when full or timed)│
  │                         ▼                           │
  │         ┌───────────────────────────────┐           │
  │         │      TSM Files (on disk)      │           │
  │         │  ┌──────┐ ┌──────┐ ┌──────┐  │           │
  │         │  │ TSM1 │ │ TSM2 │ │ TSM3 │  │           │
  │         │  └──────┘ └──────┘ └──────┘  │           │
  │         └──────────────┬────────────────┘           │
  │                        │ compaction                 │
  │                        ▼                            │
  │         ┌───────────────────────────────┐           │
  │         │   Compacted TSM Files         │           │
  │         │  ┌─────────────┐ ┌──────────┐ │           │
  │         │  │   TSM-C1    │ │  TSM-C2  │ │           │
  │         │  └─────────────┘ └──────────┘ │           │
  │         └───────────────────────────────┘           │
  │                                                     │
  │  ┌─────────┐   ┌─────────────────┐                 │
  │  │  Query   │──▶│   Read Path    │                 │
  │  │  Engine  │   │ (InfluxQL/Flux)│                 │
  │  └─────────┘   └─────────────────┘                 │
  └─────────────────────────────────────────────────────┘

Line Protocol: The Wire Format

InfluxDB uses a text-based line protocol for writes. Each line represents a single data point:

measurement,tag_key=tag_val,tag_key2=tag_val2 field_key=field_val,field_key2=field_val2 timestamp

Concrete examples — let's say you're monitoring a web server fleet:

# CPU usage for two hosts
cpu,host=web-01,region=us-east usage_idle=72.4,usage_system=12.1,usage_user=15.5 1713189903000000000
cpu,host=web-02,region=us-east usage_idle=58.1,usage_system=18.7,usage_user=23.2 1713189903000000000

# Memory metrics
mem,host=web-01,region=us-east total=16856023040i,used=12340174848i,used_percent=73.2 1713189903000000000

# HTTP request latency (histogram bucket)
http_request,method=GET,endpoint=/api/users,status=200 latency_ms=45.2,count=1i 1713189903000000000
http_request,method=POST,endpoint=/api/orders,status=201 latency_ms=128.7,count=1i 1713189903000000000

# IoT sensor reading
temperature,sensor_id=A7,building=HQ,floor=3 value=23.4,humidity=67.8 1713189903000000000

# Stock price data
stock_price,symbol=AAPL,exchange=NASDAQ open=177.30,high=179.12,low=176.88,close=178.52,volume=52340000i 1713189903000000000

Protocol details:

Measurement: Equivalent to a table name. No schema definition needed — schema-on-write.
Tags: Comma-separated after measurement. Always string values. Indexed for fast lookups. Separate from fields by a single space.
Fields: After the space. Support float64 (default), int64 (suffix i), string (quoted), boolean (true/false). Not indexed.
Timestamp: Nanosecond-precision Unix epoch. Optional — server assigns current time if omitted.

Batch write example using curl:

# Write 3 points in a single HTTP POST (batching is critical for throughput)
curl -i -XPOST 'http://localhost:8086/write?db=metrics&precision=ns' \
  --data-binary '
cpu,host=web-01,region=us-east usage_idle=72.4 1713189903000000000
cpu,host=web-01,region=us-east usage_idle=71.8 1713189904000000000
cpu,host=web-01,region=us-east usage_idle=73.1 1713189905000000000
'

# Response: HTTP/1.1 204 No Content (success, no body)

Querying: InfluxQL and Flux

InfluxDB supports two query languages. InfluxQL is SQL-like and familiar; Flux is a functional, pipe-based language with richer transformations.

InfluxQL examples:

-- Average CPU usage per host over the last hour, grouped into 5-minute windows
SELECT MEAN("usage_idle") FROM "cpu"
WHERE time > now() - 1h
GROUP BY time(5m), "host"
FILL(previous)

-- 95th percentile HTTP latency by endpoint over the last 24 hours
SELECT PERCENTILE("latency_ms", 95) FROM "http_request"
WHERE time > now() - 24h AND "method" = 'GET'
GROUP BY time(1h), "endpoint"

-- Count distinct sensors reporting in the last 10 minutes
SELECT COUNT(DISTINCT("sensor_id")) FROM "temperature"
WHERE time > now() - 10m

-- Continuous query: auto-downsample CPU data to 1-hour averages
CREATE CONTINUOUS QUERY "cq_cpu_1h" ON "metrics"
BEGIN
  SELECT MEAN("usage_idle") AS "mean_idle",
         MAX("usage_idle")  AS "max_idle",
         MIN("usage_idle")  AS "min_idle"
  INTO "metrics_downsampled"."autogen"."cpu_1h"
  FROM "cpu"
  GROUP BY time(1h), "host"
END

Flux examples:

// Same query in Flux — CPU average per host, last hour, 5-min windows
from(bucket: "metrics")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_idle")
  |> aggregateWindow(every: 5m, fn: mean, createEmpty: false)
  |> group(columns: ["host"])
  |> yield(name: "cpu_by_host")

// Alert: find hosts where CPU usage exceeds 90% for 5+ consecutive minutes
from(bucket: "metrics")
  |> range(start: -30m)
  |> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage_idle")
  |> aggregateWindow(every: 1m, fn: mean)
  |> map(fn: (r) => ({r with _value: 100.0 - r._value}))  // Convert idle → usage
  |> filter(fn: (r) => r._value > 90.0)
  |> stateCount(fn: (r) => r._value > 90.0, column: "consecutive_high")
  |> filter(fn: (r) => r.consecutive_high >= 5)
  |> yield(name: "alert_high_cpu")

The TSM Storage Engine

TSM (Time-Structured Merge) is InfluxDB's core storage engine. It's conceptually similar to an LSM-tree but tailored for time-series access patterns:

Write-Ahead Log (WAL): Every incoming write is first appended to the WAL — a sequential, append-only file. This guarantees durability: if the process crashes, the WAL replays on restart. WAL writes are O(1) — no random I/O.
In-Memory Cache: Data is simultaneously inserted into an in-memory sorted map (the cache). The cache serves recent reads instantly. It's organized by series key → sorted timestamp list.
Flush to TSM: When the cache exceeds a size threshold (default 25 MB) or a time interval elapses (default 10 minutes), it's flushed to disk as an immutable TSM file. Each TSM file contains compressed blocks of data for one or more series, sorted by timestamp.
Compaction: Background compaction merges multiple small TSM files into fewer, larger ones. This reduces file count (lowering file descriptor usage) and improves read performance by reducing the number of files a query must scan. Compaction levels: L1 (initial), L2 (merged), L3 (fully compacted).

TSM file internal structure:

  ┌──────────────────────────────────────────────┐
  │              TSM File Layout                  │
  ├──────────────────────────────────────────────┤
  │  Data Blocks (compressed)                    │
  │  ┌────────────────────────────────────────┐  │
  │  │ Block 1: cpu,host=web-01#usage_idle    │  │
  │  │   Timestamps: [t1, t2, ..., t1000]     │  │
  │  │   Values:     [72.4, 71.8, ..., 73.1]  │  │
  │  │   Encoding: delta-of-delta + Gorilla   │  │
  │  ├────────────────────────────────────────┤  │
  │  │ Block 2: cpu,host=web-02#usage_idle    │  │
  │  │   Timestamps: [t1, t2, ..., t1000]     │  │
  │  │   Values:     [58.1, 59.3, ..., 57.8]  │  │
  │  ├────────────────────────────────────────┤  │
  │  │ Block N: ...                           │  │
  │  └────────────────────────────────────────┘  │
  ├──────────────────────────────────────────────┤
  │  Index (series key → block offset + range)   │
  │  ┌────────────────────────────────────────┐  │
  │  │ "cpu,host=web-01#usage_idle"           │  │
  │  │   Block 1: offset=0, min_t=t1, max_t=t1000│
  │  │ "cpu,host=web-02#usage_idle"           │  │
  │  │   Block 2: offset=4096, min_t=t1, ...  │  │
  │  └────────────────────────────────────────┘  │
  ├──────────────────────────────────────────────┤
  │  Footer: index offset, version, checksum     │
  └──────────────────────────────────────────────┘

TimescaleDB

TimescaleDB takes the opposite approach from InfluxDB: instead of building a new database, it extends PostgreSQL. It's implemented as a PostgreSQL extension that automatically partitions data by time into hypertables and chunks, while preserving full SQL compatibility, joins, triggers, and the entire PostgreSQL ecosystem.

Hypertables and Chunks

A hypertable is TimescaleDB's core abstraction. It looks and behaves like a regular PostgreSQL table, but under the hood, it's automatically partitioned into chunks — each chunk is a standard PostgreSQL table covering a specific time range.

  Hypertable: "sensor_data"
  ┌───────────────────────────────────────────────────────┐
  │                                                       │
  │  User sees: one table, standard SQL                   │
  │                                                       │
  │  Under the hood:                                      │
  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐     │
  │  │  Chunk 1    │ │  Chunk 2    │ │  Chunk 3    │ ... │
  │  │ Apr 1–7     │ │ Apr 8–14    │ │ Apr 15–21   │     │
  │  │ (PG table)  │ │ (PG table)  │ │ (PG table)  │     │
  │  │ 12M rows    │ │ 12M rows    │ │ 8M rows     │     │
  │  └─────────────┘ └─────────────┘ └─────────────┘     │
  │                                                       │
  │  Each chunk has its own indexes, TOAST, statistics    │
  │  Chunk size chosen so active chunk fits in memory     │
  └───────────────────────────────────────────────────────┘

Creating a hypertable:

-- Step 1: Create a regular PostgreSQL table
CREATE TABLE sensor_data (
    time        TIMESTAMPTZ      NOT NULL,
    sensor_id   INTEGER          NOT NULL,
    building    TEXT              NOT NULL,
    floor       SMALLINT         NOT NULL,
    temperature DOUBLE PRECISION,
    humidity    DOUBLE PRECISION,
    pressure    DOUBLE PRECISION,
    battery_pct SMALLINT
);

-- Step 2: Convert to hypertable — TimescaleDB takes over partitioning
-- chunk_time_interval: each chunk covers 7 days of data
SELECT create_hypertable('sensor_data', 'time',
    chunk_time_interval => INTERVAL '7 days'
);

-- Step 3: Add indexes optimized for common queries
CREATE INDEX idx_sensor_time ON sensor_data (sensor_id, time DESC);
CREATE INDEX idx_building    ON sensor_data (building, time DESC);

-- Step 4: Enable compression on chunks older than 7 days
ALTER TABLE sensor_data SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'sensor_id',
    timescaledb.compress_orderby = 'time DESC'
);

-- Automatically compress chunks older than 7 days
SELECT add_compression_policy('sensor_data', INTERVAL '7 days');

Inserting data — standard SQL INSERT:

-- Single row insert
INSERT INTO sensor_data (time, sensor_id, building, floor, temperature, humidity, pressure, battery_pct)
VALUES ('2026-04-15 14:05:03+00', 7, 'HQ', 3, 23.4, 67.8, 1013.25, 94);

-- Batch insert (COPY is fastest for bulk loads)
COPY sensor_data FROM STDIN WITH (FORMAT csv);
2026-04-15 14:05:03+00,7,HQ,3,23.4,67.8,1013.25,94
2026-04-15 14:05:03+00,12,HQ,5,22.1,71.2,1013.20,88
2026-04-15 14:05:03+00,19,Warehouse-B,1,18.7,55.3,1012.98,76
\.

-- Multi-row VALUES (good for application-level batching)
INSERT INTO sensor_data VALUES
    ('2026-04-15 14:05:04+00', 7,  'HQ', 3, 23.5, 67.6, 1013.25, 94),
    ('2026-04-15 14:05:04+00', 12, 'HQ', 5, 22.0, 71.4, 1013.20, 88),
    ('2026-04-15 14:05:04+00', 19, 'Warehouse-B', 1, 18.8, 55.1, 1012.98, 76);

Continuous Aggregates

Continuous aggregates are TimescaleDB's killer feature for dashboards. They're materialized views that automatically refresh as new data arrives — giving you pre-computed rollups without manual cron jobs:

-- Create a continuous aggregate: hourly temperature stats per building
CREATE MATERIALIZED VIEW hourly_building_temp
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', time)       AS bucket,
    building,
    AVG(temperature)                  AS avg_temp,
    MIN(temperature)                  AS min_temp,
    MAX(temperature)                  AS max_temp,
    COUNT(*)                          AS sample_count,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY temperature) AS p95_temp
FROM sensor_data
GROUP BY bucket, building
WITH NO DATA;  -- Don't backfill historical data yet

-- Add a policy to auto-refresh (refresh lag = 1 hour, window = 2 days)
SELECT add_continuous_aggregate_policy('hourly_building_temp',
    start_offset  => INTERVAL '2 days',
    end_offset    => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour'
);

-- Query the continuous aggregate — instant response, no scanning raw data
SELECT bucket, building, avg_temp, max_temp, sample_count
FROM hourly_building_temp
WHERE bucket > now() - INTERVAL '7 days'
  AND building = 'HQ'
ORDER BY bucket DESC;

-- Create a daily aggregate ON TOP of the hourly aggregate (cascading)
CREATE MATERIALIZED VIEW daily_building_temp
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 day', bucket) AS day,
    building,
    AVG(avg_temp)                AS avg_temp,
    MIN(min_temp)                AS min_temp,
    MAX(max_temp)                AS max_temp,
    SUM(sample_count)            AS total_samples
FROM hourly_building_temp
GROUP BY day, building
WITH NO DATA;

Retention and Tiered Storage

-- Automatically drop raw data chunks older than 90 days
SELECT add_retention_policy('sensor_data', INTERVAL '90 days');

-- But keep the hourly aggregates for 2 years
SELECT add_retention_policy('hourly_building_temp', INTERVAL '2 years');

-- And daily aggregates forever (no retention policy)
-- This gives a tiered data lifecycle:
--   Raw data:   0–90 days    (full resolution, ~500 GB)
--   Hourly:     0–2 years    (compressed, ~5 GB)
--   Daily:      forever      (tiny, ~50 MB)

Advanced Queries

-- Time-weighted average (handles irregular sampling intervals)
SELECT
    time_bucket('1 hour', time) AS hour,
    sensor_id,
    time_weight('linear', time, temperature) AS tw_avg_temp
FROM sensor_data
WHERE time > now() - INTERVAL '24 hours'
GROUP BY hour, sensor_id;

-- Gap filling: fill in missing data points with interpolation
SELECT
    time_bucket_gapfill('15 minutes', time) AS period,
    sensor_id,
    locf(AVG(temperature)) AS temp_filled  -- Last Observation Carried Forward
FROM sensor_data
WHERE time > now() - INTERVAL '6 hours'
  AND sensor_id = 7
GROUP BY period, sensor_id
ORDER BY period;

-- Joins with relational data — the PostgreSQL advantage
SELECT
    s.time,
    s.temperature,
    s.humidity,
    d.sensor_name,
    d.calibration_date,
    b.building_name,
    b.address
FROM sensor_data s
JOIN sensor_directory d ON s.sensor_id = d.id
JOIN buildings b ON s.building = b.code
WHERE s.time > now() - INTERVAL '1 hour'
  AND b.city = 'San Francisco'
ORDER BY s.time DESC;

Prometheus

Prometheus is not a general-purpose TSDB — it's a metrics monitoring system with a built-in time-series storage engine. Designed at SoundCloud and now a CNCF graduated project, it's the de facto standard for Kubernetes and cloud-native observability.

Pull-Based Model

Unlike InfluxDB (push-based), Prometheus pulls metrics from targets at configurable intervals:

  ┌────────────────┐         ┌────────────────┐
  │  Application   │  scrape │   Prometheus   │
  │  /metrics      │◀────────│   Server       │
  │  endpoint      │  (HTTP) │                │
  └────────────────┘         │  ┌──────────┐  │
                             │  │  TSDB    │  │
  ┌────────────────┐         │  │ (local)  │  │
  │  Node Exporter │◀────────│  └──────────┘  │
  │  :9100/metrics │  every  │                │
  └────────────────┘  15s    │  ┌──────────┐  │
                             │  │  PromQL  │  │
  ┌────────────────┐         │  │  Engine  │  │
  │  kube-state    │◀────────│  └──────────┘  │
  │  -metrics      │         └───────┬────────┘
  └────────────────┘                 │
                              ┌──────▼──────┐
                              │   Grafana   │
                              │ Dashboards  │
                              └─────────────┘

Exposition format (what /metrics returns):

# HELP http_requests_total Total HTTP requests processed.
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/users",status="200"} 145832
http_requests_total{method="POST",endpoint="/api/orders",status="201"} 8847
http_requests_total{method="GET",endpoint="/api/users",status="500"} 23

# HELP http_request_duration_seconds HTTP request latency in seconds.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",le="0.005"} 98232
http_request_duration_seconds_bucket{method="GET",le="0.01"}  112845
http_request_duration_seconds_bucket{method="GET",le="0.025"} 130921
http_request_duration_seconds_bucket{method="GET",le="0.05"}  139442
http_request_duration_seconds_bucket{method="GET",le="0.1"}   143200
http_request_duration_seconds_bucket{method="GET",le="+Inf"}  145832
http_request_duration_seconds_sum{method="GET"} 2917.32
http_request_duration_seconds_count{method="GET"} 145832

PromQL query examples:

# Request rate over the last 5 minutes (per-second)
rate(http_requests_total{status="200"}[5m])

# 99th percentile latency from histogram
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

# CPU usage percentage per instance
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Alert rule: high error rate
# fires when >5% of requests return 5xx for 10+ minutes
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05

Prometheus TSDB Storage

Prometheus v2 uses a custom TSDB engine that writes data in 2-hour blocks:

Head block: In-memory, accepts all current writes. Backed by a WAL for durability.
Persistent blocks: Every 2 hours, the head block is cut and written to disk as an immutable block. Each block contains chunks (compressed samples), an index (label → series → chunk mapping), and tombstones (deletion markers).
Compaction: Older blocks are merged into larger blocks (up to 31 days). This reduces file count and improves query performance for long-range queries.
Retention: Configurable by time (--storage.tsdb.retention.time=90d) or size (--storage.tsdb.retention.size=50GB). Old blocks are deleted.

Prometheus vs InfluxDB: Prometheus is purpose-built for metrics monitoring with a pull model, alerting (Alertmanager), and PromQL. It stores data locally and is not designed for long-term storage. InfluxDB is a general-purpose TSDB with a push model, SQL-like queries, and built-in retention/downsampling. For long-term storage, Prometheus users typically pair it with Thanos, Cortex, or Mimir.

Compression Techniques

Time-series data compresses exceptionally well because of its inherent patterns: timestamps are sequential, values change slowly, and entire series share the same data type. Purpose-built TSDBs exploit these patterns with domain-specific compression codecs that achieve 10–20× compression ratios compared to raw storage.

Gorilla Encoding (Facebook, 2015)

The seminal paper "Gorilla: A Fast, Scalable, In-Memory Time Series Database" by Facebook introduced two compression techniques that have been adopted by virtually every TSDB since:

Delta-of-Delta for Timestamps

When data arrives at regular intervals (e.g., every 60 seconds), timestamps are highly predictable. Instead of storing absolute timestamps, we store the delta of deltas:

Raw timestamps (Unix epoch seconds):
  t0 = 1713189900    (base timestamp)
  t1 = 1713189960    (t0 + 60)
  t2 = 1713190020    (t1 + 60)
  t3 = 1713190080    (t2 + 60)
  t4 = 1713190141    (t3 + 61, slight jitter)

Step 1: Compute deltas (Δ)
  Δ1 = t1 - t0 = 60
  Δ2 = t2 - t1 = 60
  Δ3 = t3 - t2 = 60
  Δ4 = t4 - t3 = 61

Step 2: Compute delta-of-deltas (ΔΔ)
  ΔΔ2 = Δ2 - Δ1 = 0    → encode as single '0' bit
  ΔΔ3 = Δ3 - Δ2 = 0    → encode as single '0' bit
  ΔΔ4 = Δ4 - Δ3 = 1    → encode as '10' + 7-bit value = 9 bits

Storage comparison for 1000 timestamps at 60s intervals:
  Raw:              1000 × 64 bits = 64,000 bits (8,000 bytes)
  Delta:            1 × 64 + 999 × 10 bits ≈ 10,054 bits (1,257 bytes)
  Delta-of-delta:   1 × 64 + 1 × 10 + 998 × 1 bit ≈ 1,072 bits (134 bytes)

  Compression ratio: 8,000 / 134 ≈ 59.7×

The variable-length encoding for delta-of-delta values uses these bit patterns:

ΔΔ value	Encoding	Total bits
0	`0`	1 bit
-63 to 64	`10` + 7-bit value	9 bits
-255 to 256	`110` + 9-bit value	12 bits
-2047 to 2048	`1110` + 12-bit value	16 bits
Anything else	`1111` + 32-bit value	36 bits

XOR Encoding for Float Values

Consecutive sensor readings tend to be very similar. If temperature is 23.4°C now, it's probably 23.3°C or 23.5°C next. XOR encoding exploits this:

Values:  23.40, 23.42, 23.38, 23.41

IEEE 754 double-precision representation (64 bits each):
  v0 = 0x4037666666666666   (23.40)
  v1 = 0x40376B851EB851EC   (23.42)
  v2 = 0x4037614F23C5F1E2   (23.38)
  v3 = 0x403768F5C28F5C29   (23.41)

XOR with previous value:
  v0:          store full 64 bits (first value)
  v1 XOR v0 =  0x000000D338CEB78A    (only 10 meaningful bits differ)
  v2 XOR v1 =  0x00000A5A1C7DA00E    (more bits differ)
  v3 XOR v2 =  0x00000FBA01EAA3CB

Encoding rules:
  - If XOR = 0 (same value): store single '0' bit
  - If XOR != 0:
    - Store '1' bit
    - If leading/trailing zeros same as previous: '0' + meaningful bits only
    - Otherwise: '1' + 5-bit leading zeros count + 6-bit block size + meaningful bits

For slowly changing values, most XOR results have 50+ leading zeros,
requiring only 10–15 bits per value instead of 64.

Typical compression for float values: 1.37 bits per value (Facebook's result)
  Raw: 64 bits per value
  Compressed: ~1.37 bits per value
  Ratio: 64 / 1.37 ≈ 46.7×

End-to-End Compression Ratios

Data Type	Raw Size/point	Compressed	Ratio	Technique
Timestamps (regular)	8 bytes	~0.13 bytes	~60×	Delta-of-delta + VLE
Timestamps (irregular)	8 bytes	~1.5 bytes	~5×	Delta-of-delta + VLE
Float values (smooth)	8 bytes	~0.17 bytes	~47×	XOR + bit packing
Float values (volatile)	8 bytes	~2.0 bytes	~4×	XOR + bit packing
Integer counters	8 bytes	~0.5 bytes	~16×	Delta + varint
Boolean flags	1 byte	~0.13 bytes	~8×	Run-length encoding

Worked example — 1 year of server metrics:

Scenario: 500 servers, 30 metrics each, sampled every 15 seconds

Data points per day:   500 × 30 × (86,400 / 15) = 86,400,000 points/day
Data points per year:  86,400,000 × 365 = 31,536,000,000 points (~31.5 billion)

Raw storage (timestamp + value per point):
  31.5B × 16 bytes = 504 GB

With Gorilla compression (timestamps regular, values smooth):
  Timestamps: 31.5B × 0.13 bytes = 4.1 GB
  Values:     31.5B × 0.17 bytes = 5.4 GB
  Total:      ~9.5 GB

With additional block-level LZ4/Snappy:
  ~6.5 GB (additional 1.5× from general-purpose compression)

Final compression ratio: 504 GB / 6.5 GB ≈ 77.5×

This means 1 year of metrics for 500 servers fits on a single 10 GB SSD.

▶ Time-Series Write & Compact

Watch data points arrive, buffer, flush to immutable segments, and compact.

Downsampling & Retention Policies

No organization needs second-level granularity for data that's 3 years old. Downsampling is the practice of replacing high-resolution data with lower-resolution aggregates as data ages. Combined with retention policies, this creates a tiered data lifecycle that balances storage cost with query usefulness.

Typical Downsampling Strategy

Data Age	Resolution	Aggregations Kept	Storage (500 servers, 30 metrics)
0–7 days	Raw (15s)	All individual data points	~1.2 GB
7–30 days	1 minute	min, max, avg, count, sum	~400 MB
30–365 days	1 hour	min, max, avg, count, p95, p99	~150 MB
1–5 years	1 day	min, max, avg, count	~10 MB

Total storage for 5 years: ~1.76 GB instead of 2.52 TB raw — a 1,431× reduction.

InfluxDB Retention & Downsampling

-- Create retention policies with different durations
CREATE RETENTION POLICY "raw"      ON "metrics" DURATION 7d  REPLICATION 1 DEFAULT
CREATE RETENTION POLICY "rollup_1m" ON "metrics" DURATION 30d REPLICATION 1
CREATE RETENTION POLICY "rollup_1h" ON "metrics" DURATION 365d REPLICATION 1
CREATE RETENTION POLICY "rollup_1d" ON "metrics" DURATION INF REPLICATION 1

-- Continuous queries to auto-downsample
-- Raw (15s) → 1-minute rollup
CREATE CONTINUOUS QUERY "cq_1m" ON "metrics"
BEGIN
  SELECT MEAN(*) AS mean_*, MIN(*) AS min_*, MAX(*) AS max_*, COUNT(*) AS count_*
  INTO "metrics"."rollup_1m".:MEASUREMENT
  FROM /.*/ GROUP BY time(1m), *
END

-- 1-minute → 1-hour rollup
CREATE CONTINUOUS QUERY "cq_1h" ON "metrics"
BEGIN
  SELECT MEAN("mean_*") AS mean_*, MIN("min_*") AS min_*, MAX("max_*") AS max_*
  INTO "metrics"."rollup_1h".:MEASUREMENT
  FROM "metrics"."rollup_1m"./.*/ GROUP BY time(1h), *
END

▶ Downsampling Visualization

See how high-resolution data becomes progressively coarser as it ages.

Write-Optimized Storage

Time-series databases must sustain millions of writes per second while maintaining reasonable read performance. The key data structure that enables this is the LSM tree (Log-Structured Merge-tree) and its time-series variants.

LSM-Tree Recap for Time-Series

  Write Path (optimized for sequential I/O):
  ┌───────────────────────────────────────────────────────────┐
  │                                                           │
  │  1. Write arrives                                         │
  │     ↓                                                     │
  │  2. Append to WAL (sequential write, O(1))                │
  │     ↓                                                     │
  │  3. Insert into MemTable (in-memory sorted structure)     │
  │     ↓ (when MemTable full)                                │
  │  4. Flush to SSTable on disk (sorted, immutable)          │
  │     ↓ (background)                                        │
  │  5. Compaction merges SSTables → fewer, larger files      │
  │                                                           │
  └───────────────────────────────────────────────────────────┘

  Read Path:
  ┌───────────────────────────────────────────────────────────┐
  │  1. Check MemTable (most recent data, in-memory)          │
  │  2. Check Block Bloom filters (skip irrelevant SSTables)  │
  │  3. Check block min/max time index (skip out-of-range)    │
  │  4. Read + decompress matching blocks                     │
  │  5. Merge results from multiple SSTables                  │
  └───────────────────────────────────────────────────────────┘

Time-Series LSM Optimizations

TSDBs enhance the basic LSM approach with time-aware optimizations:

Time-based partitioning: Data is first partitioned by time window (e.g., 2-hour blocks in Prometheus, 1-day shards in InfluxDB). This means compaction only merges files within the same time window, keeping the merge scope small.
Series-key sorting within blocks: Within each time partition, data is sorted by series key (metric + tags), then by timestamp. This enables efficient range scans for a single series across a time window.
Block-level min/max indexes: Each block stores the minimum and maximum timestamp it contains. A query for "last 5 minutes" can skip all blocks with max_timestamp older than 5 minutes ago — no Bloom filter needed.
Columnar storage within blocks: Timestamps and values are stored in separate columns within each block. This enables type-specific compression (delta-of-delta for timestamps, XOR for floats) and allows reading only the columns a query needs.
Write batching: Incoming data points are grouped by series key in the MemTable. When flushing, all points for a series are written contiguously, maximizing compression ratios and sequential read performance.

Write Performance Comparison

Database	Storage Engine	Write Throughput (single node)	Compression Ratio
InfluxDB OSS	TSM (LSM variant)	~500K points/sec	~10–15×
TimescaleDB	PostgreSQL heap + chunks	~150K rows/sec (batched COPY)	~5–10× (with compression)
Prometheus	Custom TSDB blocks	~1M samples/sec (scrape)	~12–16×
QuestDB	Column-based append-only	~1.5M rows/sec	~8–12×
ClickHouse	MergeTree (LSM variant)	~1M+ rows/sec	~10–20×
PostgreSQL (vanilla)	B-tree heap	~30K rows/sec	~2–3× (TOAST)

Use Cases

IoT Sensor Monitoring

The canonical time-series use case. A smart building with 10,000 sensors (temperature, humidity, light, motion, CO2) reporting every 5 seconds generates 172.8 million data points per day. Requirements:

Ingestion: Sustained 2,000 writes/sec per sensor gateway. Batch writes in 1-second windows.
Queries: "Current temperature on floor 3." "Average humidity trend over the past week." "Which sensors have anomalous readings?"
Retention: Raw data for 30 days (forensic analysis), hourly aggregates for 2 years (trend analysis), daily aggregates forever (compliance).
Best fit: InfluxDB or TimescaleDB. TimescaleDB if you need joins with sensor metadata; InfluxDB for simpler deployments.

Application & Infrastructure Metrics

Monitoring a 5,000-pod Kubernetes cluster where each pod exports 200 metrics at 15-second intervals: 66 million samples/minute.

Metrics types: Counters (request_count), gauges (memory_usage), histograms (request_latency), summaries (gc_duration).
Alerting: "Fire if error rate > 1% for 5 minutes." "Page if p99 latency > 500ms."
Dashboards: Real-time Grafana dashboards showing request rate, error rate, latency percentiles, resource utilization.
Best fit: Prometheus + Thanos/Mimir for long-term storage. Prometheus handles the scraping and short-term storage; Thanos/Mimir provides global querying and durable storage in object stores (S3/GCS).

Financial Market Data

A trading platform ingesting tick data for 10,000 instruments across multiple exchanges — potentially 10 million ticks per second during peak hours. Requirements are extreme:

Latency: Sub-millisecond write latency. Queries must return in < 10ms for real-time trading screens.
Precision: Nanosecond timestamps. Prices must preserve decimal precision (no floating-point approximation).
Historical analysis: Backtest trading strategies over 10+ years of tick data (tens of terabytes).
Best fit: QuestDB (optimized for financial data with nanosecond precision), KDB+/q (industry standard for quantitative finance), or TimescaleDB (if SQL and joins with order/position tables are needed).

Log Analytics

While full-text log search is typically handled by Elasticsearch/OpenSearch, structured log metrics are a perfect time-series use case:

Structured metrics from logs: Request count, error count, bytes transferred, response time — extracted from access logs and stored as time-series data.
Volume: A 1,000-server web tier generating 50,000 requests/sec produces 4.3 billion log entries per day. Storing full logs is expensive; extracting metrics and storing as time-series data is 100–1000× cheaper.
Queries: "What was the error rate between 2 PM and 3 PM?" "Show request volume by endpoint over the last 24 hours."
Best fit: ClickHouse (excellent for log analytics with SQL), InfluxDB (for pure metrics extracted from logs), or a combination of Loki (log aggregation) + Prometheus (metrics).

Choosing a Time-Series Database

Criteria	InfluxDB	TimescaleDB	Prometheus
Query language	InfluxQL, Flux	Full SQL (PostgreSQL)	PromQL
Joins	Limited (Flux)	Full SQL joins	Label matching only
Write model	Push (HTTP/UDP)	Push (SQL INSERT)	Pull (scrape)
Schema	Schema-on-write	Schema-on-write (DDL)	Schema-free labels
Compression	Gorilla + TSM	PG native + columnar	Gorilla variant
Ecosystem	Telegraf, Kapacitor	PostgreSQL extensions	Grafana, Alertmanager
Clustering	Enterprise only	Multi-node (paid)	Thanos / Mimir / Cortex
Best for	IoT, general TSDB	SQL users, relational joins	Kubernetes metrics

Summary

Time-series databases exist because general-purpose databases buckle under the unique demands of timestamp-ordered, append-only, high-throughput data. The key takeaways:

Append-only, time-ordered data enables write-optimized storage (LSM variants) and domain-specific compression (Gorilla encoding) that achieve 10–80× compression ratios.
InfluxDB is the go-to purpose-built TSDB with the TSM engine, line protocol, and InfluxQL/Flux query languages. Best for IoT and general time-series workloads.
TimescaleDB extends PostgreSQL with hypertables and chunks, giving you full SQL power (joins, continuous aggregates, stored procedures) with time-series performance.
Prometheus is the standard for Kubernetes metrics monitoring — pull-based, PromQL, and paired with Grafana. Not a general-purpose TSDB.
Downsampling and retention policies are essential for managing storage costs. Raw data for days, hourly aggregates for months, daily aggregates for years.
High cardinality (too many unique tag/label combinations) is the #1 performance killer across all TSDBs. Design your labeling schema carefully.

In the next post, we'll explore Search Engines — how Elasticsearch and Solr handle full-text search, inverted indexes, and distributed document retrieval at scale.