Logging, Monitoring & Observability
You can build the most elegant microservice architecture in the world, but if you can't see what's happening inside it, you're flying blind. Observability is what turns a black-box production system into a transparent, debuggable, and trustworthy one. In this post we'll cover the three pillars of observability—logs, metrics, and traces—then dive deep into the tooling, patterns, and practices that make modern systems observable.
1. Three Pillars of Observability
Observability is the ability to understand the internal state of a system by examining its external outputs. Those outputs fall into three categories:
| Pillar | What It Is | Example |
|---|---|---|
| Logs | Discrete, timestamped events | ERROR: Payment failed for order #1234 |
| Metrics | Numeric measurements aggregated over time | http_requests_total{status="500"} = 42 |
| Traces | End-to-end request path across services | API Gateway → Auth → User → DB (total: 230ms) |
Why You Need All Three
Each pillar answers different questions:
- Metrics tell you something is wrong — "p99 latency spiked to 3 seconds."
- Logs tell you what went wrong — "Connection pool exhausted on db-replica-3."
- Traces tell you where it went wrong — "The User Service call took 2.8s of the 3s total."
Metrics detect. Logs explain. Traces pinpoint. An incident investigation typically moves through all three: a metric alert fires, you search logs for the timeframe, then pull the trace for a specific slow request.
2. Structured Logging
JSON Logs vs Text Logs
Traditional text logs are human-readable but machine-hostile:
2026-04-15 10:23:45 ERROR PaymentService - Payment failed for order 1234, user 5678, amount $99.99
Structured JSON logs are both human-readable and queryable:
{
"timestamp": "2026-04-15T10:23:45.123Z",
"level": "ERROR",
"service": "payment-service",
"message": "Payment failed",
"order_id": "1234",
"user_id": "5678",
"amount": 99.99,
"currency": "USD",
"error": "card_declined",
"trace_id": "abc123def456",
"span_id": "span-789",
"correlation_id": "req-00ff-11aa"
}
With structured logs you can query: level:ERROR AND service:payment-service AND amount:>50 in Elasticsearch/Kibana. With text logs, you're stuck with regex.
Log Levels
| Level | When to Use | Production Default |
|---|---|---|
DEBUG | Detailed diagnostic info. Variable values, SQL queries, cache hits. | Off |
INFO | Normal operations. "Server started on port 8080", "Order created". | On |
WARN | Unexpected but recoverable. Retry succeeded, fallback used. | On |
ERROR | Operation failed. Unhandled exception, service unavailable. | On |
FATAL | System cannot continue. Out of memory, config missing. | On |
Correlation IDs
A correlation ID (or request ID) is a unique identifier that follows a request across every service it touches. When a user reports "my checkout failed," you search for their correlation ID and instantly see every log line from every service involved in that request.
// Express.js middleware to generate/propagate correlation IDs
const { v4: uuid } = require('uuid');
app.use((req, res, next) => {
req.correlationId = req.headers['x-correlation-id'] || uuid();
res.setHeader('x-correlation-id', req.correlationId);
next();
});
// Attach to every log
logger.info('Processing order', {
correlation_id: req.correlationId,
order_id: order.id
});
Log Aggregation Pipeline
In a distributed system with hundreds of containers, you need a centralized logging pipeline. The industry-standard architecture:
▶ Log Pipeline: From App to Dashboard
Step through a log event flowing from an application through the entire aggregation pipeline.
The pipeline stages:
- Application — writes structured JSON logs to stdout/file.
- Log Agent (Fluentd, Filebeat, Fluent Bit) — tails log files, enriches with metadata (pod name, node, timestamp), and forwards.
- Transport (Kafka, Redis) — buffers logs to handle traffic spikes. Decouples producers from consumers. Provides durability and replay.
- Storage & Indexing (Elasticsearch, Loki, ClickHouse) — indexes logs for full-text search and filtering.
- Visualization (Kibana, Grafana) — dashboards, saved queries, alerting on log patterns.
The ELK Stack Deep Dive
ELK stands for Elasticsearch + Logstash + Kibana. In modern setups, Filebeat or Fluentd replace Logstash for collection:
# Filebeat configuration (filebeat.yml)
filebeat.inputs:
- type: container
paths:
- /var/lib/docker/containers/*/*.log
processors:
- add_kubernetes_metadata: ~
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
output.kafka:
hosts: ["kafka-1:9092", "kafka-2:9092"]
topic: "application-logs"
partition.round_robin:
reachable_only: true
required_acks: 1
# Logstash pipeline (consuming from Kafka, writing to Elasticsearch)
input {
kafka {
bootstrap_servers => "kafka-1:9092,kafka-2:9092"
topics => ["application-logs"]
group_id => "logstash-consumers"
codec => json
}
}
filter {
if [level] == "ERROR" {
mutate { add_tag => ["error_alert"] }
}
date {
match => [ "timestamp", "ISO8601" ]
target => "@timestamp"
}
geoip {
source => "client_ip"
}
}
output {
elasticsearch {
hosts => ["https://es-node-1:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
user => "elastic"
password => "${ES_PASSWORD}"
}
}
3. Metrics & Monitoring
Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing value. Only goes up (or resets to 0). | http_requests_total |
| Gauge | Value that goes up and down. Snapshot of current state. | temperature_celsius, queue_depth |
| Histogram | Samples observations into configurable buckets. Calculates quantiles server-side. | http_request_duration_seconds |
| Summary | Like histogram but calculates quantiles client-side. Cannot be aggregated across instances. | rpc_duration_seconds{quantile="0.99"} |
Prometheus: Pull-Based Monitoring
Prometheus is the de facto standard for metrics in cloud-native systems. Unlike push-based systems (StatsD, Graphite), Prometheus pulls (scrapes) metrics from your services over HTTP.
# prometheus.yml - scrape configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'api-gateway'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- job_name: 'payment-service'
static_configs:
- targets: ['payment-svc:8080']
metrics_path: /metrics
Instrumenting Your Application
# Python - Prometheus client library
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Counter: total requests
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# Histogram: request latency with custom buckets
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
# Gauge: active connections
ACTIVE_CONNECTIONS = Gauge(
'active_connections',
'Number of active connections'
)
# Usage in request handler
@app.route('/api/orders', methods=['POST'])
def create_order():
ACTIVE_CONNECTIONS.inc()
with REQUEST_LATENCY.labels(method='POST', endpoint='/api/orders').time():
try:
result = process_order(request.json)
REQUEST_COUNT.labels(method='POST', endpoint='/api/orders', status='200').inc()
return jsonify(result), 200
except Exception as e:
REQUEST_COUNT.labels(method='POST', endpoint='/api/orders', status='500').inc()
raise
finally:
ACTIVE_CONNECTIONS.dec()
# Expose /metrics endpoint on port 8000
start_http_server(8000)
PromQL: Prometheus Query Language
PromQL is what makes Prometheus powerful. Here are essential queries every engineer should know:
# Request rate (requests per second) over the last 5 minutes
rate(http_requests_total[5m])
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# p99 latency from histogram
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# p99 latency per endpoint
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)
# Requests per second by status code
sum by (status) (rate(http_requests_total[5m]))
# Top 5 endpoints by error count
topk(5, sum by (endpoint) (rate(http_requests_total{status=~"5.."}[5m])))
# Predict disk full in 4 hours (linear regression)
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
# CPU usage percentage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Grafana Dashboard JSON Snippet
{
"dashboard": {
"title": "API Service Overview",
"panels": [
{
"title": "Request Rate (req/s)",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(http_requests_total{service=\"api-gateway\"}[5m]))",
"legendFormat": "Total RPS"
}],
"fieldConfig": {
"defaults": { "unit": "reqps", "color": { "mode": "palette-classic" } }
}
},
{
"title": "Error Rate (%)",
"type": "stat",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
"legendFormat": "Error %"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
}
},
{
"title": "p50 / p95 / p99 Latency",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p99"
}
],
"fieldConfig": {
"defaults": { "unit": "s" }
}
}
]
}
}
The RED Method
For request-driven services (APIs, web servers), monitor these three signals:
- Rate — requests per second:
sum(rate(http_requests_total[5m])) - Errors — failed requests per second:
sum(rate(http_requests_total{status=~"5.."}[5m])) - Duration — latency distribution:
histogram_quantile(0.99, ...)
The USE Method
For resource-oriented systems (CPU, memory, disk, network), monitor:
- Utilization — how busy is the resource? (CPU%, memory%, disk I/O%)
- Saturation — how much extra work is queued? (run queue length, swap usage)
- Errors — how many errors are occurring? (disk errors, network packet drops)
The Four Golden Signals (Google SRE)
Google's SRE book defines four golden signals for any user-facing system:
- Latency — time to serve a request (distinguish successful vs failed requests)
- Traffic — demand on the system (HTTP requests/sec, transactions/sec)
- Errors — rate of failed requests (explicit 5xx, implicit timeouts, wrong results)
- Saturation — how "full" the system is (memory, CPU, I/O, queue depth)
4. Distributed Tracing
In a monolith, a stack trace tells you everything. In microservices, a single user request might touch 10+ services. Distributed tracing follows that request across every hop.
Core Concepts
- Trace — the entire journey of a request. A trace is a directed acyclic graph of spans.
- Span — a single unit of work (e.g., one service call, one DB query). Has a start time, duration, and metadata (tags, logs).
- Parent-child relationship — if Service A calls Service B, the span in A is the parent of the span in B.
- Trace ID — a globally unique ID shared by all spans in a trace.
- Span ID — unique identifier for each span. Includes a parent span ID reference.
▶ Distributed Trace: Following a Request
Watch a request flow through API Gateway → Auth Service → User Service → Database, building the trace tree with latencies.
W3C Trace Context Headers
The W3C Trace Context standard defines how trace information propagates across service boundaries via HTTP headers:
# Incoming request headers
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
# Format: version-traceId-parentSpanId-traceFlags
# 00 = version
# 0af765... = 128-bit trace ID
# b7ad6b... = 64-bit parent span ID
# 01 = trace flags (01 = sampled)
OpenTelemetry Instrumentation
OpenTelemetry (OTel) is the CNCF standard that unifies metrics, logs, and traces. Here's how to instrument a Python service:
# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
# pip install opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Configure the tracer
resource = Resource.create({
"service.name": "order-service",
"service.version": "1.2.0",
"deployment.environment": "production"
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Auto-instrument Flask and outgoing HTTP requests
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
@app.route('/api/orders', methods=['POST'])
def create_order():
with tracer.start_as_current_span("create_order") as span:
span.set_attribute("order.customer_id", request.json['customer_id'])
# Child span for validation
with tracer.start_as_current_span("validate_order"):
validate(request.json)
# Child span for database write
with tracer.start_as_current_span("write_to_database") as db_span:
db_span.set_attribute("db.system", "postgresql")
db_span.set_attribute("db.statement", "INSERT INTO orders ...")
order = save_order(request.json)
# Child span for notification (calls another service)
with tracer.start_as_current_span("send_notification"):
requests.post("http://notification-service/api/notify",
json={"order_id": order.id})
span.set_attribute("order.id", order.id)
return jsonify(order.to_dict()), 201
Go Instrumentation with OpenTelemetry
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(context.Background(),
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil { return nil, err }
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("user-service"),
semconv.ServiceVersionKey.String("2.0.1"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func GetUser(ctx context.Context, userID string) (*User, error) {
tracer := otel.Tracer("user-service")
ctx, span := tracer.Start(ctx, "GetUser")
defer span.End()
span.SetAttributes(attribute.String("user.id", userID))
// DB call creates a child span
user, err := db.QueryUser(ctx, userID)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return nil, err
}
return user, nil
}
Tracing Tools: Jaeger vs Zipkin
| Feature | Jaeger | Zipkin |
|---|---|---|
| Origin | Uber (CNCF graduated) | |
| Storage backends | Cassandra, Elasticsearch, Kafka, Badger | Cassandra, Elasticsearch, MySQL |
| Adaptive sampling | Yes (built-in) | Limited |
| Architecture | Agent + Collector + Query + UI | Single binary or distributed |
| OTel compatibility | Native (recommended backend) | Via OTLP exporter |
Sampling Strategies
Tracing every request at high scale is expensive. Sampling reduces volume while maintaining visibility:
| Strategy | How It Works | Pros/Cons |
|---|---|---|
| Head-based | Decision made at the start of the request. "Sample 10% of all requests." | Simple, but interesting (slow/error) traces may be dropped. |
| Tail-based | Collect all spans, decide after the trace completes. Keep slow/error traces, drop normal ones. | Captures interesting traces. Requires buffering all spans temporarily. |
| Adaptive/Dynamic | Adjusts sample rate based on traffic. Low traffic → 100%, high traffic → 1%. | Best of both worlds. Built into Jaeger. |
# OpenTelemetry Collector configuration for tail-based sampling
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
# Always keep error traces
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] }
# Always keep slow traces (>2s)
- name: latency-policy
type: latency
latency: { threshold_ms: 2000 }
# Sample 10% of remaining traces
- name: probabilistic-policy
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp:
endpoint: "jaeger-collector:4317"
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [otlp]
5. Alerting
Alert Fatigue: The Silent Killer
The number one problem with alerting isn't missing alerts—it's too many. Alert fatigue occurs when engineers receive so many alerts that they start ignoring them. This is how real incidents get missed.
Rules for healthy alerting:
- Every alert must be actionable. If an engineer can't do anything about it, it's not an alert—it's a log line.
- Alert on symptoms, not causes. Alert on "p99 latency > 2s" (user-facing impact), not "CPU > 80%" (may be fine).
- Set appropriate thresholds. Use burn-rate alerts (see SLO section) instead of static thresholds.
- Classify severity. P1 (page immediately), P2 (page during business hours), P3 (ticket, no page).
Prometheus Alerting Rules
# alerts.yml
groups:
- name: api-alerts
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "Error rate above 5% for 5 minutes"
description: "Current error rate: {{ $value | humanizePercentage }}"
runbook_url: "https://wiki.internal/runbooks/high-error-rate"
# High p99 latency
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2.0
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "p99 latency above 2s for 10 minutes"
description: "Current p99: {{ $value | humanizeDuration }}"
# Disk space prediction
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
for: 30m
labels:
severity: warning
team: infra
annotations:
summary: "Disk predicted to fill within 4 hours"
# Pod crash looping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 15m
labels:
severity: critical
team: platform
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
Alertmanager Routing
# alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: 'default-slack'
group_by: ['alertname', 'team']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
team: backend
receiver: 'slack-backend'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<PAGERDUTY_SERVICE_KEY>'
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
dashboard: 'https://grafana.internal/d/api-overview'
- name: 'slack-backend'
slack_configs:
- api_url: '<SLACK_WEBHOOK_URL>'
channel: '#backend-alerts'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'default-slack'
slack_configs:
- api_url: '<SLACK_WEBHOOK_URL>'
channel: '#ops-alerts'
Runbooks and Escalation
Every alert should have a runbook—a step-by-step guide for the on-call engineer:
- What is this alert? Explain the metric, threshold, and what it means.
- What is the impact? Which users/features are affected?
- How to diagnose? Links to dashboards, log queries, relevant traces.
- How to mitigate? Immediate steps (restart, rollback, scale up, failover).
- How to resolve? Root cause fix (deploy patch, increase capacity, fix config).
- Escalation path. Who to page if you can't resolve it in 30 minutes.
Escalation policies typically follow a chain: on-call engineer → team lead → engineering manager → VP Engineering. Tools like PagerDuty and OpsGenie automate this with configurable timeouts at each level.
6. SLIs, SLOs, and SLAs
Definitions
| Concept | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A quantitative measure of service quality. The metric you actually measure. | Proportion of requests completing under 300ms. Availability = successful requests / total requests. |
| SLO (Service Level Objective) | A target value for an SLI. The internal goal. | 99.9% of requests should complete under 300ms. 99.95% availability over 30 days. |
| SLA (Service Level Agreement) | A contractual commitment to customers. Includes consequences for violation (refunds, credits). | "99.9% uptime guaranteed. Below that, customers get 10% credit." |
Error Budgets
An error budget is the inverse of an SLO. If your SLO is 99.9% availability, your error budget is 0.1%—you're "allowed" to be unavailable for 0.1% of the time.
# Error budget calculation for 99.9% availability SLO
# Over a 30-day window:
Total minutes in 30 days = 30 × 24 × 60 = 43,200 minutes
Error budget = 0.1% × 43,200 = 43.2 minutes of allowed downtime
# If you've used 30 minutes of downtime this month:
Remaining budget = 43.2 - 30 = 13.2 minutes
Budget consumed = 30 / 43.2 = 69.4%
How teams use error budgets:
- Budget remaining? Ship features, experiment, deploy frequently.
- Budget nearly exhausted? Freeze feature launches. Focus on reliability work.
- Budget burned? Stop all deployments. Fix the root causes that consumed the budget.
SLO Burn-Rate Alerts
Instead of alerting on "error rate > 5%," alert on how fast you're consuming your error budget. This produces fewer, more meaningful alerts:
# Multi-window, multi-burn-rate SLO alerts
# SLO: 99.9% availability (error budget: 0.1%)
groups:
- name: slo-alerts
rules:
# Fast burn: consuming budget 14.4x faster than allowed
# Will exhaust 30-day budget in 2 days
- alert: SLO_HighBurnRate_Critical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
labels:
severity: critical
annotations:
summary: "High burn rate: error budget will exhaust in ~2 days"
# Slow burn: consuming budget 3x faster than allowed
# Will exhaust 30-day budget in 10 days
- alert: SLO_SlowBurnRate_Warning
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
) > (3 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/ sum(rate(http_requests_total[30m]))
) > (3 * 0.001)
labels:
severity: warning
annotations:
summary: "Slow burn rate: error budget will exhaust in ~10 days"
How Google and Netflix Use SLOs
Google: Every service has an SLO. Teams with remaining error budget can deploy freely. Teams with exhausted budgets must improve reliability before shipping features. This creates a natural tension between velocity and reliability.
Netflix: Uses a concept called "Perform" for SLO-based alerting. They track SLIs per device type (iOS, Android, Smart TV, browser) because latency expectations differ dramatically across platforms. A 200ms response is fine for a browser but slow for an API call between microservices.
7. Observability in System Design Interviews
When to Mention Observability
In a system design interview, mentioning observability demonstrates operational maturity. Bring it up:
- After the core design — once you've laid out the architecture, say "Let me talk about how we'd monitor this system."
- When discussing reliability — SLOs and error budgets show you think about failure as a budget, not just a boolean.
- When discussing microservices — "With 10 services, we need distributed tracing to debug cross-service issues."
- When asked about debugging — "Correlation IDs let us trace a user's request through every service."
How to Incorporate Monitoring in Architecture
When drawing your architecture diagram, add these components:
- Metrics sidecars or libraries in each service box, exposing
/metrics. - A Prometheus/Grafana stack for metrics collection and dashboards.
- A centralized logging pipeline (ELK or Loki) with log agents on each node.
- An OpenTelemetry collector for trace aggregation.
- Alertmanager + PagerDuty for incident notification.
Example Interview Snippet
"For observability, I'd instrument each service with OpenTelemetry
to emit traces, metrics, and structured JSON logs.
Metrics: Prometheus scrapes /metrics endpoints every 15s.
I'd track the RED method — rate, errors, duration — per service.
Grafana dashboards show real-time health.
Logs: Each pod runs a Fluent Bit sidecar that ships logs to
Kafka, then into Elasticsearch. We query via Kibana.
All logs include a correlation_id for request tracing.
Traces: OpenTelemetry SDK with W3C Trace Context propagation.
Tail-based sampling at the OTel Collector — keep 100% of
error/slow traces, sample 5% of normal traffic.
Jaeger for trace visualization.
SLOs: 99.9% availability, p99 latency under 500ms.
That gives us a 43-minute error budget per month.
Burn-rate alerts notify PagerDuty if we're consuming
the budget faster than sustainable."
Summary
- Three pillars: Logs (events), Metrics (numbers), Traces (request paths). You need all three.
- Structured logging: JSON format, log levels, correlation IDs, ELK stack for aggregation.
- Metrics: Counters, gauges, histograms. Prometheus pull model, PromQL, Grafana dashboards.
- Monitoring methods: RED (request-driven), USE (resource-driven), Golden Signals (both).
- Distributed tracing: Spans, traces, OpenTelemetry instrumentation, W3C Trace Context headers.
- Sampling: Head-based (simple), tail-based (smart), adaptive (best of both).
- Alerting: Alert on symptoms, not causes. Runbooks for every alert. Avoid alert fatigue.
- SLIs/SLOs/SLAs: Measure (SLI), set targets (SLO), promise customers (SLA). Error budgets balance velocity and reliability.
- Interviews: Mention observability after the core design. Use SLOs, correlation IDs, and tracing to show operational maturity.