High Level Design Series · Building Blocks · Part 1

Message Queues & Event Streaming

April 2026 · 24 min read

Why Message Queues?

In any distributed system, services need to communicate. The simplest approach is synchronous request-response: Service A calls Service B, waits for a response, then continues. This works until it doesn't—Service B goes down, gets overloaded, or is simply too slow. Now Service A is stuck, and the failure cascades.

Message queues solve this by introducing an intermediary buffer between producers (senders) and consumers (receivers). Instead of calling Service B directly, Service A drops a message into a queue and moves on. Service B picks it up when it's ready.

The Postal Service Analogy

Think of synchronous communication as a phone call: both parties must be available at the same time. A message queue is like the postal service: you drop a letter in the mailbox and go about your day. The postal service handles delivery, buffering (holding mail at the post office if the recipient isn't home), and retry (redelivery attempts). The sender and receiver are completely decoupled in time and space.

Core Benefits

Decoupling: Producers and consumers evolve independently. The producer doesn't need to know who consumes its messages, how many consumers there are, or what technology they use.
Asynchronous Processing: The producer doesn't wait for the consumer. A web server can accept an order, enqueue a message, and return a 202 Accepted in milliseconds, while order processing takes minutes.
Traffic Buffering (Load Leveling): During a flash sale, traffic might spike 100x. Without a queue, your downstream services must handle 100x load or drop requests. With a queue, the spike is absorbed into the buffer, and consumers process at their own pace.
Resilience & Fault Tolerance: If a consumer crashes, messages remain in the queue. When the consumer restarts, it picks up where it left off. No data is lost.
Fan-out: A single message can trigger processing in multiple independent services (notifications, analytics, inventory, etc.).

Synchronous vs Asynchronous Communication

Aspect	Synchronous (HTTP/gRPC)	Asynchronous (Queue)
Coupling	Tight—caller blocks until response	Loose—fire and forget
Latency	Dependent on downstream service	Producer returns immediately
Failure Handling	Cascading failures	Messages buffered; consumer retries
Scaling	Both sides must scale together	Producer and consumer scale independently
Use Case	Read queries, real-time responses	Background jobs, event processing, notifications

▶ Producer-Consumer Flow

Watch messages flow from producers through a queue to consumers with load balancing.

Point-to-Point vs Pub/Sub

There are two fundamental messaging patterns, and understanding when to use each is critical for system design.

Point-to-Point (Queue)

In the point-to-point model, a message is produced by one producer and consumed by exactly one consumer. Once a consumer acknowledges a message, it's removed from the queue. If multiple consumers listen on the same queue, the broker distributes messages across them for load balancing (competing consumers pattern).

Producer --> [Queue] --> Consumer A  (gets M1, M3, M5)
                     --> Consumer B  (gets M2, M4, M6)

Use cases: task distribution (send emails, process images, run batch jobs), work queues where each task must be processed exactly once.

Publish/Subscribe (Topic)

In the pub/sub model, a message is published to a topic, and every subscriber to that topic receives a copy of the message. The publisher doesn't know or care how many subscribers exist.

Publisher --> [Topic: order-created]
                |--> Notification Service (sends email)
                |--> Inventory Service   (reserves stock)
                |--> Analytics Service   (records metrics)

Use cases: event-driven architectures, broadcasting state changes, fan-out to multiple downstream services.

Fan-out Pattern

Fan-out combines pub/sub with point-to-point. A message is published to a topic, which fans out to multiple queues, each with its own competing consumers. This gives you both broadcast delivery and independent scaling per consumer group.

Comparison

Feature	Point-to-Point	Pub/Sub
Consumers per message	Exactly one	All subscribers
Message lifetime	Deleted after ack	Retained per retention policy
Use case	Task distribution, work queues	Event broadcasting, fan-out
Coupling	Producer knows queue exists	Publisher knows only the topic
Examples	SQS, RabbitMQ queues	SNS, Kafka topics, RabbitMQ fanout exchange

Apache Kafka Deep Dive

Apache Kafka is a distributed event streaming platform capable of handling trillions of events per day. Unlike traditional message queues, Kafka is designed as a distributed commit log—messages are persisted to disk and can be replayed.

Core Concepts

Topic: A named category/feed of messages. Analogous to a database table. Example: order-events, user-clicks, payment-transactions.
Partition: Each topic is split into one or more partitions. Partitions are the unit of parallelism. Each partition is an ordered, immutable sequence of records, each assigned a sequential offset.
Broker: A Kafka server. A cluster typically has 3–100+ brokers. Each partition has one leader broker and N-1 follower replicas.
Producer: Publishes records to topics. Chooses which partition to write to (by key hash, round-robin, or custom partitioner).
Consumer Group: A set of consumers that cooperatively consume a topic. Each partition is assigned to exactly one consumer in the group, enabling parallel consumption.
Offset: A unique sequential ID for each record within a partition. Consumers track their position via offsets. This enables replay, rewind, and independent progress per consumer group.
ZooKeeper / KRaft: ZooKeeper was used for cluster metadata, leader election, and configuration management. Kafka 3.3+ introduced KRaft mode (Kafka Raft), eliminating the ZooKeeper dependency.

Write Path

Producer
  |
  |--> Serialize record (key, value, timestamp, headers)
  |--> Determine partition (hash(key) % num_partitions, or round-robin)
  |--> Send to partition leader broker
  |
Leader Broker
  |--> Append to partition log segment on disk (sequential I/O)
  |--> Replicate to follower brokers (in-sync replicas / ISR)
  |--> Once min.insync.replicas have acknowledged → ack to producer
  |
Producer receives ack (based on acks setting):
  acks=0  → fire and forget (no ack waited)
  acks=1  → leader wrote it (fast, slight risk)
  acks=all → all ISR replicas wrote it (safest)

Read Path

Consumer (in a Consumer Group)
  |
  |--> Assigned partitions by Group Coordinator
  |--> Fetches records starting from committed offset
  |--> Processes batch of records
  |--> Commits offset (auto-commit or manual)
  |
Offset Storage: __consumer_offsets internal topic
  |--> Tracks (group, topic, partition) → offset
  |--> Enables restart from last committed position

Ordering Guarantees

Kafka guarantees ordering within a single partition only. If you need all events for a given entity (e.g., user_id=42) to be in order, use that entity's ID as the message key. All records with the same key go to the same partition via consistent hashing.

Design Tip: Choose partition keys carefully. Using user_id as a key ensures all events for a user are ordered. Using a random key maximizes throughput but destroys ordering. Hot keys (e.g., a single celebrity's user_id getting 90% of traffic) create partition skew.

Retention Policies

Unlike traditional queues, Kafka doesn't delete messages after consumption. Messages are retained based on:

Time-based: retention.ms=604800000 (7 days default). Messages older than this are eligible for deletion.
Size-based: retention.bytes=1073741824 (1 GB). When a partition exceeds this, oldest segments are deleted.
Compaction: cleanup.policy=compact. Kafka keeps only the latest record for each key. Perfect for changelogs, state snapshots, and event sourcing. Example: a topic tracking user profiles keeps only the latest profile per user_id.

Production Configuration

# Broker configuration (server.properties) num.partitions=12 default.replication.factor=3 min.insync.replicas=2 unclean.leader.election.enable=false log.retention.hours=168 log.segment.bytes=1073741824 num.io.threads=8 num.network.threads=3 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 # Producer configuration acks=all retries=2147483647 max.in.flight.requests.per.connection=5 enable.idempotence=true compression.type=lz4 batch.size=65536 linger.ms=5 buffer.memory=33554432 # Consumer configuration enable.auto.commit=false auto.offset.reset=earliest max.poll.records=500 max.poll.interval.ms=300000 session.timeout.ms=30000 heartbeat.interval.ms=10000

Performance Characteristics

Kafka achieves extraordinary throughput through several design choices:

Sequential I/O: Appending to the end of a log is an O(1) disk operation. Sequential disk writes can be faster than random memory access.
Zero-copy transfer: Uses sendfile() system call to transfer data directly from disk page cache to network socket, bypassing user-space.
Batching: Producers batch multiple records into a single network request. Consumers fetch in batches.
Compression: LZ4, Snappy, ZSTD, or GZIP compression at the batch level reduces network I/O.
Partitioning: Horizontal scaling—more partitions means more parallelism.

A single Kafka broker can handle hundreds of MB/s of writes and reads. LinkedIn's Kafka clusters process 7+ trillion messages per day.

▶ Kafka Partitioning & Consumer Groups

See how messages are routed to partitions by key and consumed in parallel by a consumer group.

RabbitMQ

RabbitMQ is a traditional message broker implementing the Advanced Message Queuing Protocol (AMQP). Unlike Kafka's log-based approach, RabbitMQ is a smart broker / dumb consumer model: the broker manages message routing, delivery, and acknowledgment.

AMQP Architecture

RabbitMQ introduces the concept of exchanges—routers that sit between producers and queues:

Producer → Exchange → Binding → Queue → Consumer
              ↑                    ↑
         routing logic        binding key/pattern

Exchange Types

Exchange	Routing Logic	Example Use Case
Direct	Message routing key must exactly match the binding key	Route `payment.success` to payment queue
Topic	Routing key matches a pattern with `*` (one word) and `#` (zero or more words)	`order.*.created` matches `order.us.created`
Fanout	Ignores routing key; broadcasts to all bound queues	Broadcast order events to all services
Headers	Routes based on message header attributes (key-value pairs) instead of routing key	Route by content-type, priority, or custom headers

Key Mechanisms

Acknowledgments (ack/nack): A consumer sends an ack after processing a message. If the consumer crashes before acking, RabbitMQ redelivers the message to another consumer. Use basic.nack with requeue=true to reject and requeue, or requeue=false to send to a dead letter exchange.
Prefetch Count: Controls how many unacknowledged messages the broker sends to a consumer. prefetch=1 means the consumer gets one message at a time (fair dispatch but slower). prefetch=100 improves throughput but risks memory pressure.
Dead Letter Exchanges (DLX): When a message is rejected, expires (TTL), or the queue exceeds its max length, it's routed to a configured DLX for inspection or retry.
Publisher Confirms: The broker acknowledges receipt of a published message, ensuring the producer knows the message reached the queue.
Durable Queues & Persistent Messages: Queues survive broker restarts when declared durable. Messages survive when published with deliveryMode=2 (persistent).

RabbitMQ Configuration Example

# Python with pika library
import pika

connection = pika.BlockingConnection(
    pika.ConnectionParameters('rabbitmq-host', 5672,
        credentials=pika.PlainCredentials('user', 'pass'))
)
channel = connection.channel()

# Declare durable exchange and queue
channel.exchange_declare(
    exchange='orders',
    exchange_type='topic',
    durable=True
)
channel.queue_declare(
    queue='notification-service',
    durable=True,
    arguments={
        'x-dead-letter-exchange': 'orders-dlx',
        'x-message-ttl': 300000,       # 5 min TTL
        'x-max-length': 100000         # max 100k messages
    }
)
channel.queue_bind(
    queue='notification-service',
    exchange='orders',
    routing_key='order.*.created'       # topic pattern
)

# Consumer with manual ack
channel.basic_qos(prefetch_count=10)

def callback(ch, method, properties, body):
    try:
        process_order(body)
        ch.basic_ack(delivery_tag=method.delivery_tag)
    except Exception:
        ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)

channel.basic_consume(
    queue='notification-service',
    on_message_callback=callback
)

When to Use RabbitMQ vs Kafka

Criteria	RabbitMQ	Kafka
Model	Smart broker, dumb consumer	Dumb broker, smart consumer
Throughput	~50k msg/s per node	~1M+ msg/s per broker
Retention	Messages deleted after ack	Retained for configurable period
Replay	Not supported	Full replay by resetting offset
Routing	Rich (exchanges, bindings, patterns)	Simple (topics, partitions)
Protocol	AMQP, STOMP, MQTT	Custom binary protocol
Best for	Complex routing, task queues, RPC	Event streaming, log aggregation, high throughput

AWS SQS & SNS

AWS provides fully managed messaging services that eliminate the operational overhead of running Kafka or RabbitMQ clusters yourself.

Amazon SQS (Simple Queue Service)

SQS is a fully managed, serverless message queue. You don't provision brokers, manage partitions, or worry about replication—AWS handles all of it.

Standard vs FIFO Queues

Feature	Standard	FIFO
Throughput	Nearly unlimited	300 msg/s (3,000 with batching)
Ordering	Best-effort (may be out of order)	Strict FIFO within message group
Delivery	At-least-once (may duplicate)	Exactly-once processing
Queue name	Any name	Must end in `.fifo`

Key SQS Concepts

Visibility Timeout: When a consumer receives a message, it becomes invisible to other consumers for the timeout period (default 30s). If the consumer doesn't delete the message in time, it reappears in the queue for another consumer. Set this to slightly longer than your max processing time.
Long Polling: Instead of repeatedly asking "any messages?" (short polling), long polling waits up to 20 seconds for a message to arrive. Reduces empty responses and API costs by up to 90%.
Dead Letter Queue (DLQ): After N failed processing attempts (maxReceiveCount), messages are moved to a DLQ for inspection. Critical for debugging poisoned messages.
Message Retention: 1 minute to 14 days (default 4 days).
Message Size: Up to 256 KB. For larger payloads, store in S3 and send a reference.

# AWS CLI: Create FIFO queue with DLQ
aws sqs create-queue \
  --queue-name orders.fifo \
  --attributes '{
    "FifoQueue": "true",
    "ContentBasedDeduplication": "true",
    "VisibilityTimeout": "60",
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789:orders-dlq.fifo\",\"maxReceiveCount\":\"3\"}"
  }'

Amazon SNS (Simple Notification Service)

SNS is a fully managed pub/sub service. You publish a message to a topic, and SNS delivers it to all subscribed endpoints: SQS queues, Lambda functions, HTTP endpoints, email, SMS, or mobile push.

SNS + SQS: The Fan-out Pattern

The most common production pattern combines SNS and SQS. An event is published to an SNS topic, which fans out to multiple SQS queues. Each queue has its own consumer(s) that process independently and at their own pace.

Order Service
  |
  |--> Publish to SNS topic "order-events"
         |
         |--> SQS: notification-queue  --> Notification Service
         |--> SQS: inventory-queue     --> Inventory Service
         |--> SQS: analytics-queue     --> Analytics Service
         |--> Lambda: fraud-check      --> Fraud Detection

This pattern gives you the best of both worlds: broadcast delivery (SNS) with independent, reliable, buffered processing (SQS).

Delivery Guarantees

One of the most important concepts in messaging systems is the delivery guarantee. There are three levels, each with different trade-offs:

At-Most-Once

Fire and forget. The producer sends the message and doesn't wait for an ack. The message may be lost if the broker or network fails. No retries.

// Kafka: acks=0 → at-most-once
props.put("acks", "0");      // don't wait for broker ack
props.put("retries", 0);     // no retries

Use case: Metrics collection, logging, clickstream where losing a few events is acceptable.

At-Least-Once

The producer retries until it receives an acknowledgment. The message is guaranteed to be delivered, but may be delivered multiple times (duplicates). This is the most common guarantee in production systems.

// Kafka: acks=all + retries → at-least-once
props.put("acks", "all");
props.put("retries", Integer.MAX_VALUE);
props.put("max.in.flight.requests.per.connection", "5");

Challenge: Consumers must be idempotent—processing the same message twice must produce the same result.

Exactly-Once

The hardest guarantee. Each message is delivered and processed exactly once, with no duplicates and no losses. Kafka achieves this through idempotent producers and transactions:

// Kafka exactly-once semantics (EOS)
props.put("enable.idempotence", "true");         // dedup at broker
props.put("transactional.id", "order-processor"); // enable transactions

// Producer transaction
producer.initTransactions();
try {
    producer.beginTransaction();
    producer.send(new ProducerRecord<>("output-topic", key, value));
    // Commit consumer offsets AND producer records atomically
    producer.sendOffsetsToTransaction(offsets, consumerGroupId);
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

Reality Check: True exactly-once is only achievable within Kafka (Kafka-to-Kafka). Once you interact with external systems (databases, APIs), you're back to at-least-once with idempotent consumers. Design for at-least-once and make your consumers idempotent.

Idempotent Consumer Strategies

Deduplication by ID: Store processed message IDs in a database. Before processing, check if the ID exists. Use a unique message_id or idempotency_key.
Database upserts: Use INSERT ... ON CONFLICT DO UPDATE instead of plain inserts. Reprocessing the same record just overwrites with the same data.
Idempotent operations: "Set balance to $100" is idempotent. "Add $10 to balance" is not. Design your events as state assertions, not deltas.
Conditional writes: Use optimistic locking. "Update if version = 5" will fail on duplicate processing if version is already 6.

Backpressure

Backpressure occurs when consumers can't keep up with the rate of incoming messages. The queue grows unboundedly, consuming memory and disk until the system fails. Handling backpressure is critical for production systems.

Symptoms

Queue depth (lag) growing continuously
Consumer processing latency increasing
Memory/disk usage on broker climbing
Message age in queue exceeding SLAs

Strategies

Scale consumers horizontally: Add more consumer instances. In Kafka, add consumers to the group (up to the number of partitions). In SQS, spin up more Lambda functions or ECS tasks.
Rate-limit producers: If consumers can't keep up, slow down the producers. Use token bucket or leaky bucket rate limiters. This pushes backpressure upstream.
Increase batch size: Consumers that process messages in batches (e.g., bulk database inserts) are orders of magnitude faster than one-at-a-time processing.
Buffer management: Set maximum queue lengths. When the queue is full, either reject new messages (fail-fast), drop oldest messages, or apply backpressure to the producer.
Circuit breaker: If the consumer's downstream dependency (e.g., database) is overloaded, stop consuming temporarily rather than accumulating failed messages.

# Monitoring Kafka consumer lag
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group order-processor \
  --describe

# Output:
# GROUP          TOPIC          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# order-processor orders        0          15234           15240           6
# order-processor orders        1          18901           21500           2599  ← problem!
# order-processor orders        2          17003           17003           0

Alert on lag, not just errors. A consumer might be "working fine" with zero errors but falling behind. Monitor consumer lag (difference between latest offset and consumer's committed offset) and alert when it exceeds a threshold.

Dead Letter Queues

A Dead Letter Queue (DLQ) is a special queue where messages go when they can't be processed successfully. Think of it as the "undeliverable mail" bin at the post office.

When Messages Go to DLQ

Max retries exceeded: After N failed processing attempts, the message is moved to the DLQ instead of being retried forever.
Message rejected: Consumer explicitly rejects the message (e.g., invalid payload, schema mismatch).
TTL expired: Message sat in the queue longer than its time-to-live.
Queue overflow: Queue reached its maximum length, and oldest messages are evicted to DLQ.

Retry Strategy: Exponential Backoff

Don't retry immediately after a failure—the downstream system is probably still broken. Use exponential backoff with jitter:

// Retry delays: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped)
function getRetryDelay(attempt) {
    const baseDelay = 1000;          // 1 second
    const maxDelay  = 60000;         // 60 seconds cap
    const exponential = baseDelay * Math.pow(2, attempt);
    const capped = Math.min(exponential, maxDelay);
    const jitter = capped * (0.5 + Math.random() * 0.5);  // 50-100%
    return jitter;
}

// Multi-stage retry with separate queues:
// main-queue → retry-queue-1 (1 min delay)
//            → retry-queue-2 (5 min delay)
//            → retry-queue-3 (30 min delay)
//            → dead-letter-queue (manual intervention)

Handling DLQ Messages

Alert immediately: DLQ growth is a critical signal. Set up alerts (CloudWatch, PagerDuty) when DLQ depth > 0.
Inspect & diagnose: Log the message body, headers, and error reason. Most DLQ messages are caused by schema changes, null fields, or downstream outages.
Replay after fix: Once the bug is fixed, replay DLQ messages back to the main queue. AWS provides StartMessageMoveTask API for SQS DLQ redrive.
Poison pill isolation: Some messages are permanently unprocessable (corrupt data, impossible state). After manual review, archive or delete them.

Event Streaming vs Message Queuing

These terms are often used interchangeably, but they represent fundamentally different paradigms:

Dimension	Message Queue	Event Stream
Data model	Message (command or task)	Event (immutable fact)
After consumption	Deleted from queue	Retained in log
Replay	Not possible	Reset offset to replay
Ordering	FIFO within queue	Ordered within partition
Consumer groups	Competing consumers (one gets it)	Each group gets all messages
Backpressure	Queue grows, may need DLQ	Consumer controls pace via offset
Examples	RabbitMQ, SQS, ActiveMQ	Kafka, Kinesis, Pulsar, Redpanda

When to Use Each

Message Queue: When you have tasks to distribute. Each task should be processed by exactly one consumer. Examples: send email, resize image, process payment. The message is a command ("do this").
Event Stream: When you have events (facts) that multiple services care about. The event is a notification ("this happened"). Examples: user signed up, order placed, payment received. Multiple independent consumers process the same event differently.

Stream Processing

Event streams enable real-time stream processing—continuously transforming, filtering, aggregating, and joining streams of events.

Kafka Streams: A Java library (not a separate cluster) for building stream processing applications. Supports stateful operations (aggregations, joins, windowing) with exactly-once semantics. Runs embedded in your application.
Apache Flink: A distributed stream processing framework. Handles complex event processing (CEP), windowed aggregations, exactly-once state, and event-time processing. Used by Uber, Alibaba, Netflix.
Apache Spark Structured Streaming: Micro-batch stream processing. Higher latency than Flink but integrates well with the Spark ecosystem for ML pipelines.

// Kafka Streams: Real-time order aggregation
StreamsBuilder builder = new StreamsBuilder();

KStream<String, Order> orders = builder.stream("orders");

// Count orders per product in 5-minute windows
KTable<Windowed<String>, Long> productCounts = orders
    .groupBy((key, order) -> order.getProductId())
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .count(Materialized.as("product-counts"));

// Write results to output topic
productCounts.toStream()
    .map((windowedKey, count) -> KeyValue.pair(
        windowedKey.key(),
        windowedKey.window().start() + ": " + count))
    .to("product-order-counts");

▶ Pub/Sub Fan-out: Order Processing

Watch a single order event fan out to multiple services through a topic.

Message Queues in System Design

Message queues appear in almost every system design interview. Here's how to use them effectively.

When to Introduce a Queue

Async work: Any operation that doesn't need to complete before responding to the user. "Your order has been placed" → queue the order processing.
Rate mismatch: Producer generates events faster than the consumer can process. Classic example: web servers (fast) → data pipeline (slow).
Decoupling services: When you want services to evolve independently. Adding a new subscriber shouldn't require changes to the publisher.
Reliability: When you can't afford to lose work. Persisting to a queue is your safety net against consumer failures.
Spike absorption: Flash sales, viral content, breaking news—queues absorb spikes that would otherwise crash your backend.

Common Interview Patterns

System	Queue Use Case	Technology Choice
URL Shortener	Analytics event streaming (click tracking)	Kafka (high throughput, replay)
E-commerce	Order processing pipeline (payment, inventory, shipping)	SNS + SQS (fan-out + reliable delivery)
Chat Application	Message delivery to offline users	SQS per user or Kafka per-user partition
Notification System	Enqueue and prioritize notifications	RabbitMQ (priority queues, routing)
Log Aggregation	Collect logs from all services for centralized processing	Kafka (high volume, retention)
Rate Limiter	Smooth out bursty request patterns	SQS + Lambda (serverless scaling)

Sizing Queues

In an interview, you'll need to estimate queue capacity:

// Example: E-commerce order queue sizing
Orders per day:     1,000,000
Peak multiplier:    10x (flash sale)
Peak orders/sec:    ~115 orders/sec × 10 = 1,150/sec
Avg message size:   2 KB (JSON payload)

Throughput needed:  1,150 × 2 KB = 2.3 MB/s
Daily storage:      1M × 2 KB = 2 GB
With 7-day retention (Kafka): 14 GB per partition set
Partitions needed:  ~12 (for parallelism across 12 consumers)

// SQS: No sizing needed (serverless, auto-scales)
// Kafka: 12 partitions × 3 replicas = 36 partition replicas
//        across a 3-broker cluster (12 per broker)

Monitoring Essentials

Queue Depth / Consumer Lag: The #1 metric. If it's growing, consumers are falling behind.
Message Age: How long the oldest message has been in the queue. Indicates processing staleness.
Consumer Throughput: Messages processed per second per consumer. Used for scaling decisions.
Error Rate: Percentage of messages that fail processing. Indicates code bugs or dependency issues.
DLQ Depth: Must be zero in steady state. Any non-zero value triggers alerts.
Producer Latency: Time to publish a message. High latency means the broker is overloaded.

System Design Checklist: When you introduce a queue in an interview, always discuss: (1) which technology and why, (2) delivery guarantee needed, (3) ordering requirements, (4) what happens when the consumer fails (DLQ strategy), (5) how you'll monitor queue health, and (6) scaling plan for consumers.