← All Posts
High Level Design Series · Building Blocks· Part 10

Logging, Monitoring & Observability

You can build the most elegant microservice architecture in the world, but if you can't see what's happening inside it, you're flying blind. Observability is what turns a black-box production system into a transparent, debuggable, and trustworthy one. In this post we'll cover the three pillars of observability—logs, metrics, and traces—then dive deep into the tooling, patterns, and practices that make modern systems observable.

1. Three Pillars of Observability

Observability is the ability to understand the internal state of a system by examining its external outputs. Those outputs fall into three categories:

PillarWhat It IsExample
LogsDiscrete, timestamped eventsERROR: Payment failed for order #1234
MetricsNumeric measurements aggregated over timehttp_requests_total{status="500"} = 42
TracesEnd-to-end request path across servicesAPI Gateway → Auth → User → DB (total: 230ms)

Why You Need All Three

Each pillar answers different questions:

Metrics detect. Logs explain. Traces pinpoint. An incident investigation typically moves through all three: a metric alert fires, you search logs for the timeframe, then pull the trace for a specific slow request.

Observability ≠ Monitoring. Monitoring is asking pre-defined questions ("Is CPU above 80%?"). Observability is the ability to ask arbitrary new questions about your system without deploying new code. High-cardinality, high-dimensionality data is the key differentiator.

2. Structured Logging

JSON Logs vs Text Logs

Traditional text logs are human-readable but machine-hostile:

2026-04-15 10:23:45 ERROR PaymentService - Payment failed for order 1234, user 5678, amount $99.99

Structured JSON logs are both human-readable and queryable:

{
  "timestamp": "2026-04-15T10:23:45.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment failed",
  "order_id": "1234",
  "user_id": "5678",
  "amount": 99.99,
  "currency": "USD",
  "error": "card_declined",
  "trace_id": "abc123def456",
  "span_id": "span-789",
  "correlation_id": "req-00ff-11aa"
}

With structured logs you can query: level:ERROR AND service:payment-service AND amount:>50 in Elasticsearch/Kibana. With text logs, you're stuck with regex.

Log Levels

LevelWhen to UseProduction Default
DEBUGDetailed diagnostic info. Variable values, SQL queries, cache hits.Off
INFONormal operations. "Server started on port 8080", "Order created".On
WARNUnexpected but recoverable. Retry succeeded, fallback used.On
ERROROperation failed. Unhandled exception, service unavailable.On
FATALSystem cannot continue. Out of memory, config missing.On
Tip: Enable dynamic log-level changes in production. Tools like Spring Boot Actuator or custom feature flags let you switch from INFO to DEBUG for a specific service without redeployment—invaluable for debugging live issues.

Correlation IDs

A correlation ID (or request ID) is a unique identifier that follows a request across every service it touches. When a user reports "my checkout failed," you search for their correlation ID and instantly see every log line from every service involved in that request.

// Express.js middleware to generate/propagate correlation IDs
const { v4: uuid } = require('uuid');

app.use((req, res, next) => {
  req.correlationId = req.headers['x-correlation-id'] || uuid();
  res.setHeader('x-correlation-id', req.correlationId);
  next();
});

// Attach to every log
logger.info('Processing order', {
  correlation_id: req.correlationId,
  order_id: order.id
});

Log Aggregation Pipeline

In a distributed system with hundreds of containers, you need a centralized logging pipeline. The industry-standard architecture:

▶ Log Pipeline: From App to Dashboard

Step through a log event flowing from an application through the entire aggregation pipeline.

The pipeline stages:

  1. Application — writes structured JSON logs to stdout/file.
  2. Log Agent (Fluentd, Filebeat, Fluent Bit) — tails log files, enriches with metadata (pod name, node, timestamp), and forwards.
  3. Transport (Kafka, Redis) — buffers logs to handle traffic spikes. Decouples producers from consumers. Provides durability and replay.
  4. Storage & Indexing (Elasticsearch, Loki, ClickHouse) — indexes logs for full-text search and filtering.
  5. Visualization (Kibana, Grafana) — dashboards, saved queries, alerting on log patterns.

The ELK Stack Deep Dive

ELK stands for Elasticsearch + Logstash + Kibana. In modern setups, Filebeat or Fluentd replace Logstash for collection:

# Filebeat configuration (filebeat.yml)
filebeat.inputs:
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_kubernetes_metadata: ~
      - decode_json_fields:
          fields: ["message"]
          target: ""
          overwrite_keys: true

output.kafka:
  hosts: ["kafka-1:9092", "kafka-2:9092"]
  topic: "application-logs"
  partition.round_robin:
    reachable_only: true
  required_acks: 1
# Logstash pipeline (consuming from Kafka, writing to Elasticsearch)
input {
  kafka {
    bootstrap_servers => "kafka-1:9092,kafka-2:9092"
    topics => ["application-logs"]
    group_id => "logstash-consumers"
    codec => json
  }
}

filter {
  if [level] == "ERROR" {
    mutate { add_tag => ["error_alert"] }
  }
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
  }
  geoip {
    source => "client_ip"
  }
}

output {
  elasticsearch {
    hosts => ["https://es-node-1:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
    user => "elastic"
    password => "${ES_PASSWORD}"
  }
}
Log retention strategy: Hot-warm-cold architecture in Elasticsearch. Keep 7 days on fast SSDs (hot), 30 days on HDDs (warm), archive to S3/GCS (cold). Use Index Lifecycle Management (ILM) policies to automate rollovers.

3. Metrics & Monitoring

Metric Types

TypeDescriptionExample
CounterMonotonically increasing value. Only goes up (or resets to 0).http_requests_total
GaugeValue that goes up and down. Snapshot of current state.temperature_celsius, queue_depth
HistogramSamples observations into configurable buckets. Calculates quantiles server-side.http_request_duration_seconds
SummaryLike histogram but calculates quantiles client-side. Cannot be aggregated across instances.rpc_duration_seconds{quantile="0.99"}

Prometheus: Pull-Based Monitoring

Prometheus is the de facto standard for metrics in cloud-native systems. Unlike push-based systems (StatsD, Graphite), Prometheus pulls (scrapes) metrics from your services over HTTP.

# prometheus.yml - scrape configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'api-gateway'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

  - job_name: 'payment-service'
    static_configs:
      - targets: ['payment-svc:8080']
    metrics_path: /metrics

Instrumenting Your Application

# Python - Prometheus client library
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Counter: total requests
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Histogram: request latency with custom buckets
REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Gauge: active connections
ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

# Usage in request handler
@app.route('/api/orders', methods=['POST'])
def create_order():
    ACTIVE_CONNECTIONS.inc()
    with REQUEST_LATENCY.labels(method='POST', endpoint='/api/orders').time():
        try:
            result = process_order(request.json)
            REQUEST_COUNT.labels(method='POST', endpoint='/api/orders', status='200').inc()
            return jsonify(result), 200
        except Exception as e:
            REQUEST_COUNT.labels(method='POST', endpoint='/api/orders', status='500').inc()
            raise
        finally:
            ACTIVE_CONNECTIONS.dec()

# Expose /metrics endpoint on port 8000
start_http_server(8000)

PromQL: Prometheus Query Language

PromQL is what makes Prometheus powerful. Here are essential queries every engineer should know:

# Request rate (requests per second) over the last 5 minutes
rate(http_requests_total[5m])

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) * 100

# p99 latency from histogram
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# p99 latency per endpoint
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

# Requests per second by status code
sum by (status) (rate(http_requests_total[5m]))

# Top 5 endpoints by error count
topk(5, sum by (endpoint) (rate(http_requests_total{status=~"5.."}[5m])))

# Predict disk full in 4 hours (linear regression)
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0

# CPU usage percentage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Grafana Dashboard JSON Snippet

{
  "dashboard": {
    "title": "API Service Overview",
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "type": "timeseries",
        "targets": [{
          "expr": "sum(rate(http_requests_total{service=\"api-gateway\"}[5m]))",
          "legendFormat": "Total RPS"
        }],
        "fieldConfig": {
          "defaults": { "unit": "reqps", "color": { "mode": "palette-classic" } }
        }
      },
      {
        "title": "Error Rate (%)",
        "type": "stat",
        "targets": [{
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
          "legendFormat": "Error %"
        }],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 1 },
                { "color": "red", "value": 5 }
              ]
            }
          }
        }
      },
      {
        "title": "p50 / p95 / p99 Latency",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "p99"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "s" }
        }
      }
    ]
  }
}

The RED Method

For request-driven services (APIs, web servers), monitor these three signals:

The USE Method

For resource-oriented systems (CPU, memory, disk, network), monitor:

The Four Golden Signals (Google SRE)

Google's SRE book defines four golden signals for any user-facing system:

  1. Latency — time to serve a request (distinguish successful vs failed requests)
  2. Traffic — demand on the system (HTTP requests/sec, transactions/sec)
  3. Errors — rate of failed requests (explicit 5xx, implicit timeouts, wrong results)
  4. Saturation — how "full" the system is (memory, CPU, I/O, queue depth)
RED vs USE vs Golden Signals: Use RED for microservices (request-driven). Use USE for infrastructure (resource-driven). Golden Signals are a superset that covers both. In practice, instrument your services with RED and your infrastructure with USE.

4. Distributed Tracing

In a monolith, a stack trace tells you everything. In microservices, a single user request might touch 10+ services. Distributed tracing follows that request across every hop.

Core Concepts

▶ Distributed Trace: Following a Request

Watch a request flow through API Gateway → Auth Service → User Service → Database, building the trace tree with latencies.

W3C Trace Context Headers

The W3C Trace Context standard defines how trace information propagates across service boundaries via HTTP headers:

# Incoming request headers
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

# Format: version-traceId-parentSpanId-traceFlags
# 00          = version
# 0af765...   = 128-bit trace ID
# b7ad6b...   = 64-bit parent span ID
# 01          = trace flags (01 = sampled)

OpenTelemetry Instrumentation

OpenTelemetry (OTel) is the CNCF standard that unifies metrics, logs, and traces. Here's how to instrument a Python service:

# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
# pip install opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configure the tracer
resource = Resource.create({
    "service.name": "order-service",
    "service.version": "1.2.0",
    "deployment.environment": "production"
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Auto-instrument Flask and outgoing HTTP requests
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

@app.route('/api/orders', methods=['POST'])
def create_order():
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.customer_id", request.json['customer_id'])

        # Child span for validation
        with tracer.start_as_current_span("validate_order"):
            validate(request.json)

        # Child span for database write
        with tracer.start_as_current_span("write_to_database") as db_span:
            db_span.set_attribute("db.system", "postgresql")
            db_span.set_attribute("db.statement", "INSERT INTO orders ...")
            order = save_order(request.json)

        # Child span for notification (calls another service)
        with tracer.start_as_current_span("send_notification"):
            requests.post("http://notification-service/api/notify",
                         json={"order_id": order.id})

        span.set_attribute("order.id", order.id)
        return jsonify(order.to_dict()), 201

Go Instrumentation with OpenTelemetry

package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil { return nil, err }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("user-service"),
            semconv.ServiceVersionKey.String("2.0.1"),
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

func GetUser(ctx context.Context, userID string) (*User, error) {
    tracer := otel.Tracer("user-service")
    ctx, span := tracer.Start(ctx, "GetUser")
    defer span.End()

    span.SetAttributes(attribute.String("user.id", userID))

    // DB call creates a child span
    user, err := db.QueryUser(ctx, userID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, err
    }
    return user, nil
}

Tracing Tools: Jaeger vs Zipkin

FeatureJaegerZipkin
OriginUber (CNCF graduated)Twitter
Storage backendsCassandra, Elasticsearch, Kafka, BadgerCassandra, Elasticsearch, MySQL
Adaptive samplingYes (built-in)Limited
ArchitectureAgent + Collector + Query + UISingle binary or distributed
OTel compatibilityNative (recommended backend)Via OTLP exporter

Sampling Strategies

Tracing every request at high scale is expensive. Sampling reduces volume while maintaining visibility:

StrategyHow It WorksPros/Cons
Head-basedDecision made at the start of the request. "Sample 10% of all requests."Simple, but interesting (slow/error) traces may be dropped.
Tail-basedCollect all spans, decide after the trace completes. Keep slow/error traces, drop normal ones.Captures interesting traces. Requires buffering all spans temporarily.
Adaptive/DynamicAdjusts sample rate based on traffic. Low traffic → 100%, high traffic → 1%.Best of both worlds. Built into Jaeger.
# OpenTelemetry Collector configuration for tail-based sampling
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      # Always keep error traces
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      # Always keep slow traces (>2s)
      - name: latency-policy
        type: latency
        latency: { threshold_ms: 2000 }
      # Sample 10% of remaining traces
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp:
    endpoint: "jaeger-collector:4317"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [otlp]

5. Alerting

Alert Fatigue: The Silent Killer

The number one problem with alerting isn't missing alerts—it's too many. Alert fatigue occurs when engineers receive so many alerts that they start ignoring them. This is how real incidents get missed.

Rules for healthy alerting:

Prometheus Alerting Rules

# alerts.yml
groups:
  - name: api-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error rate above 5% for 5 minutes"
          description: "Current error rate: {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

      # High p99 latency
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2.0
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "p99 latency above 2s for 10 minutes"
          description: "Current p99: {{ $value | humanizeDuration }}"

      # Disk space prediction
      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
        for: 30m
        labels:
          severity: warning
          team: infra
        annotations:
          summary: "Disk predicted to fill within 4 hours"

      # Pod crash looping
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
        for: 15m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

Alertmanager Routing

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'team']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
        team: backend
      receiver: 'slack-backend'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          dashboard: 'https://grafana.internal/d/api-overview'
  - name: 'slack-backend'
    slack_configs:
      - api_url: '<SLACK_WEBHOOK_URL>'
        channel: '#backend-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
  - name: 'default-slack'
    slack_configs:
      - api_url: '<SLACK_WEBHOOK_URL>'
        channel: '#ops-alerts'

Runbooks and Escalation

Every alert should have a runbook—a step-by-step guide for the on-call engineer:

  1. What is this alert? Explain the metric, threshold, and what it means.
  2. What is the impact? Which users/features are affected?
  3. How to diagnose? Links to dashboards, log queries, relevant traces.
  4. How to mitigate? Immediate steps (restart, rollback, scale up, failover).
  5. How to resolve? Root cause fix (deploy patch, increase capacity, fix config).
  6. Escalation path. Who to page if you can't resolve it in 30 minutes.

Escalation policies typically follow a chain: on-call engineer → team lead → engineering manager → VP Engineering. Tools like PagerDuty and OpsGenie automate this with configurable timeouts at each level.

6. SLIs, SLOs, and SLAs

Definitions

ConceptDefinitionExample
SLI
(Service Level Indicator)
A quantitative measure of service quality. The metric you actually measure.Proportion of requests completing under 300ms. Availability = successful requests / total requests.
SLO
(Service Level Objective)
A target value for an SLI. The internal goal.99.9% of requests should complete under 300ms. 99.95% availability over 30 days.
SLA
(Service Level Agreement)
A contractual commitment to customers. Includes consequences for violation (refunds, credits)."99.9% uptime guaranteed. Below that, customers get 10% credit."
Key insight: SLO should be stricter than SLA. If your SLA promises 99.9% availability, set your SLO at 99.95%. This gives you a buffer—you'll detect SLO violations and fix them before breaching the SLA contract.

Error Budgets

An error budget is the inverse of an SLO. If your SLO is 99.9% availability, your error budget is 0.1%—you're "allowed" to be unavailable for 0.1% of the time.

# Error budget calculation for 99.9% availability SLO
# Over a 30-day window:

Total minutes in 30 days = 30 × 24 × 60 = 43,200 minutes
Error budget = 0.1% × 43,200 = 43.2 minutes of allowed downtime

# If you've used 30 minutes of downtime this month:
Remaining budget = 43.2 - 30 = 13.2 minutes
Budget consumed = 30 / 43.2 = 69.4%

How teams use error budgets:

SLO Burn-Rate Alerts

Instead of alerting on "error rate > 5%," alert on how fast you're consuming your error budget. This produces fewer, more meaningful alerts:

# Multi-window, multi-burn-rate SLO alerts
# SLO: 99.9% availability (error budget: 0.1%)

groups:
  - name: slo-alerts
    rules:
      # Fast burn: consuming budget 14.4x faster than allowed
      # Will exhaust 30-day budget in 2 days
      - alert: SLO_HighBurnRate_Critical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
        labels:
          severity: critical
        annotations:
          summary: "High burn rate: error budget will exhaust in ~2 days"

      # Slow burn: consuming budget 3x faster than allowed
      # Will exhaust 30-day budget in 10 days
      - alert: SLO_SlowBurnRate_Warning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          ) > (3 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            / sum(rate(http_requests_total[30m]))
          ) > (3 * 0.001)
        labels:
          severity: warning
        annotations:
          summary: "Slow burn rate: error budget will exhaust in ~10 days"

How Google and Netflix Use SLOs

Google: Every service has an SLO. Teams with remaining error budget can deploy freely. Teams with exhausted budgets must improve reliability before shipping features. This creates a natural tension between velocity and reliability.

Netflix: Uses a concept called "Perform" for SLO-based alerting. They track SLIs per device type (iOS, Android, Smart TV, browser) because latency expectations differ dramatically across platforms. A 200ms response is fine for a browser but slow for an API call between microservices.

7. Observability in System Design Interviews

When to Mention Observability

In a system design interview, mentioning observability demonstrates operational maturity. Bring it up:

How to Incorporate Monitoring in Architecture

When drawing your architecture diagram, add these components:

  1. Metrics sidecars or libraries in each service box, exposing /metrics.
  2. A Prometheus/Grafana stack for metrics collection and dashboards.
  3. A centralized logging pipeline (ELK or Loki) with log agents on each node.
  4. An OpenTelemetry collector for trace aggregation.
  5. Alertmanager + PagerDuty for incident notification.

Example Interview Snippet

"For observability, I'd instrument each service with OpenTelemetry
to emit traces, metrics, and structured JSON logs.

Metrics: Prometheus scrapes /metrics endpoints every 15s.
I'd track the RED method — rate, errors, duration — per service.
Grafana dashboards show real-time health.

Logs: Each pod runs a Fluent Bit sidecar that ships logs to
Kafka, then into Elasticsearch. We query via Kibana.
All logs include a correlation_id for request tracing.

Traces: OpenTelemetry SDK with W3C Trace Context propagation.
Tail-based sampling at the OTel Collector — keep 100% of
error/slow traces, sample 5% of normal traffic.
Jaeger for trace visualization.

SLOs: 99.9% availability, p99 latency under 500ms.
That gives us a 43-minute error budget per month.
Burn-rate alerts notify PagerDuty if we're consuming
the budget faster than sustainable."

Summary