Message Queues & Event Streaming
Why Message Queues?
In any distributed system, services need to communicate. The simplest approach is synchronous request-response: Service A calls Service B, waits for a response, then continues. This works until it doesn't—Service B goes down, gets overloaded, or is simply too slow. Now Service A is stuck, and the failure cascades.
Message queues solve this by introducing an intermediary buffer between producers (senders) and consumers (receivers). Instead of calling Service B directly, Service A drops a message into a queue and moves on. Service B picks it up when it's ready.
The Postal Service Analogy
Think of synchronous communication as a phone call: both parties must be available at the same time. A message queue is like the postal service: you drop a letter in the mailbox and go about your day. The postal service handles delivery, buffering (holding mail at the post office if the recipient isn't home), and retry (redelivery attempts). The sender and receiver are completely decoupled in time and space.
Core Benefits
- Decoupling: Producers and consumers evolve independently. The producer doesn't need to know who consumes its messages, how many consumers there are, or what technology they use.
- Asynchronous Processing: The producer doesn't wait for the consumer. A web server can accept an order, enqueue a message, and return a 202 Accepted in milliseconds, while order processing takes minutes.
- Traffic Buffering (Load Leveling): During a flash sale, traffic might spike 100x. Without a queue, your downstream services must handle 100x load or drop requests. With a queue, the spike is absorbed into the buffer, and consumers process at their own pace.
- Resilience & Fault Tolerance: If a consumer crashes, messages remain in the queue. When the consumer restarts, it picks up where it left off. No data is lost.
- Fan-out: A single message can trigger processing in multiple independent services (notifications, analytics, inventory, etc.).
Synchronous vs Asynchronous Communication
| Aspect | Synchronous (HTTP/gRPC) | Asynchronous (Queue) |
|---|---|---|
| Coupling | Tight—caller blocks until response | Loose—fire and forget |
| Latency | Dependent on downstream service | Producer returns immediately |
| Failure Handling | Cascading failures | Messages buffered; consumer retries |
| Scaling | Both sides must scale together | Producer and consumer scale independently |
| Use Case | Read queries, real-time responses | Background jobs, event processing, notifications |
▶ Producer-Consumer Flow
Watch messages flow from producers through a queue to consumers with load balancing.
Point-to-Point vs Pub/Sub
There are two fundamental messaging patterns, and understanding when to use each is critical for system design.
Point-to-Point (Queue)
In the point-to-point model, a message is produced by one producer and consumed by exactly one consumer. Once a consumer acknowledges a message, it's removed from the queue. If multiple consumers listen on the same queue, the broker distributes messages across them for load balancing (competing consumers pattern).
Producer --> [Queue] --> Consumer A (gets M1, M3, M5)
--> Consumer B (gets M2, M4, M6)
Use cases: task distribution (send emails, process images, run batch jobs), work queues where each task must be processed exactly once.
Publish/Subscribe (Topic)
In the pub/sub model, a message is published to a topic, and every subscriber to that topic receives a copy of the message. The publisher doesn't know or care how many subscribers exist.
Publisher --> [Topic: order-created]
|--> Notification Service (sends email)
|--> Inventory Service (reserves stock)
|--> Analytics Service (records metrics)
Use cases: event-driven architectures, broadcasting state changes, fan-out to multiple downstream services.
Fan-out Pattern
Fan-out combines pub/sub with point-to-point. A message is published to a topic, which fans out to multiple queues, each with its own competing consumers. This gives you both broadcast delivery and independent scaling per consumer group.
Comparison
| Feature | Point-to-Point | Pub/Sub |
|---|---|---|
| Consumers per message | Exactly one | All subscribers |
| Message lifetime | Deleted after ack | Retained per retention policy |
| Use case | Task distribution, work queues | Event broadcasting, fan-out |
| Coupling | Producer knows queue exists | Publisher knows only the topic |
| Examples | SQS, RabbitMQ queues | SNS, Kafka topics, RabbitMQ fanout exchange |
Apache Kafka Deep Dive
Apache Kafka is a distributed event streaming platform capable of handling trillions of events per day. Unlike traditional message queues, Kafka is designed as a distributed commit log—messages are persisted to disk and can be replayed.
Core Concepts
- Topic: A named category/feed of messages. Analogous to a database table. Example:
order-events,user-clicks,payment-transactions. - Partition: Each topic is split into one or more partitions. Partitions are the unit of parallelism. Each partition is an ordered, immutable sequence of records, each assigned a sequential offset.
- Broker: A Kafka server. A cluster typically has 3–100+ brokers. Each partition has one leader broker and N-1 follower replicas.
- Producer: Publishes records to topics. Chooses which partition to write to (by key hash, round-robin, or custom partitioner).
- Consumer Group: A set of consumers that cooperatively consume a topic. Each partition is assigned to exactly one consumer in the group, enabling parallel consumption.
- Offset: A unique sequential ID for each record within a partition. Consumers track their position via offsets. This enables replay, rewind, and independent progress per consumer group.
- ZooKeeper / KRaft: ZooKeeper was used for cluster metadata, leader election, and configuration management. Kafka 3.3+ introduced KRaft mode (Kafka Raft), eliminating the ZooKeeper dependency.
Write Path
Producer
|
|--> Serialize record (key, value, timestamp, headers)
|--> Determine partition (hash(key) % num_partitions, or round-robin)
|--> Send to partition leader broker
|
Leader Broker
|--> Append to partition log segment on disk (sequential I/O)
|--> Replicate to follower brokers (in-sync replicas / ISR)
|--> Once min.insync.replicas have acknowledged → ack to producer
|
Producer receives ack (based on acks setting):
acks=0 → fire and forget (no ack waited)
acks=1 → leader wrote it (fast, slight risk)
acks=all → all ISR replicas wrote it (safest)
Read Path
Consumer (in a Consumer Group)
|
|--> Assigned partitions by Group Coordinator
|--> Fetches records starting from committed offset
|--> Processes batch of records
|--> Commits offset (auto-commit or manual)
|
Offset Storage: __consumer_offsets internal topic
|--> Tracks (group, topic, partition) → offset
|--> Enables restart from last committed position
Ordering Guarantees
Kafka guarantees ordering within a single partition only. If you need all events for a given entity (e.g., user_id=42) to be in order, use that entity's ID as the message key. All records with the same key go to the same partition via consistent hashing.
user_id as a key ensures all events for a user are ordered. Using a random key maximizes throughput but destroys ordering. Hot keys (e.g., a single celebrity's user_id getting 90% of traffic) create partition skew.
Retention Policies
Unlike traditional queues, Kafka doesn't delete messages after consumption. Messages are retained based on:
- Time-based:
retention.ms=604800000(7 days default). Messages older than this are eligible for deletion. - Size-based:
retention.bytes=1073741824(1 GB). When a partition exceeds this, oldest segments are deleted. - Compaction:
cleanup.policy=compact. Kafka keeps only the latest record for each key. Perfect for changelogs, state snapshots, and event sourcing. Example: a topic tracking user profiles keeps only the latest profile per user_id.
Production Configuration
Performance Characteristics
Kafka achieves extraordinary throughput through several design choices:
- Sequential I/O: Appending to the end of a log is an O(1) disk operation. Sequential disk writes can be faster than random memory access.
- Zero-copy transfer: Uses
sendfile()system call to transfer data directly from disk page cache to network socket, bypassing user-space. - Batching: Producers batch multiple records into a single network request. Consumers fetch in batches.
- Compression: LZ4, Snappy, ZSTD, or GZIP compression at the batch level reduces network I/O.
- Partitioning: Horizontal scaling—more partitions means more parallelism.
A single Kafka broker can handle hundreds of MB/s of writes and reads. LinkedIn's Kafka clusters process 7+ trillion messages per day.
▶ Kafka Partitioning & Consumer Groups
See how messages are routed to partitions by key and consumed in parallel by a consumer group.
RabbitMQ
RabbitMQ is a traditional message broker implementing the Advanced Message Queuing Protocol (AMQP). Unlike Kafka's log-based approach, RabbitMQ is a smart broker / dumb consumer model: the broker manages message routing, delivery, and acknowledgment.
AMQP Architecture
RabbitMQ introduces the concept of exchanges—routers that sit between producers and queues:
Producer → Exchange → Binding → Queue → Consumer
↑ ↑
routing logic binding key/pattern
Exchange Types
| Exchange | Routing Logic | Example Use Case |
|---|---|---|
| Direct | Message routing key must exactly match the binding key | Route payment.success to payment queue |
| Topic | Routing key matches a pattern with * (one word) and # (zero or more words) | order.*.created matches order.us.created |
| Fanout | Ignores routing key; broadcasts to all bound queues | Broadcast order events to all services |
| Headers | Routes based on message header attributes (key-value pairs) instead of routing key | Route by content-type, priority, or custom headers |
Key Mechanisms
- Acknowledgments (ack/nack): A consumer sends an
ackafter processing a message. If the consumer crashes before acking, RabbitMQ redelivers the message to another consumer. Usebasic.nackwithrequeue=trueto reject and requeue, orrequeue=falseto send to a dead letter exchange. - Prefetch Count: Controls how many unacknowledged messages the broker sends to a consumer.
prefetch=1means the consumer gets one message at a time (fair dispatch but slower).prefetch=100improves throughput but risks memory pressure. - Dead Letter Exchanges (DLX): When a message is rejected, expires (TTL), or the queue exceeds its max length, it's routed to a configured DLX for inspection or retry.
- Publisher Confirms: The broker acknowledges receipt of a published message, ensuring the producer knows the message reached the queue.
- Durable Queues & Persistent Messages: Queues survive broker restarts when declared
durable. Messages survive when published withdeliveryMode=2(persistent).
RabbitMQ Configuration Example
# Python with pika library
import pika
connection = pika.BlockingConnection(
pika.ConnectionParameters('rabbitmq-host', 5672,
credentials=pika.PlainCredentials('user', 'pass'))
)
channel = connection.channel()
# Declare durable exchange and queue
channel.exchange_declare(
exchange='orders',
exchange_type='topic',
durable=True
)
channel.queue_declare(
queue='notification-service',
durable=True,
arguments={
'x-dead-letter-exchange': 'orders-dlx',
'x-message-ttl': 300000, # 5 min TTL
'x-max-length': 100000 # max 100k messages
}
)
channel.queue_bind(
queue='notification-service',
exchange='orders',
routing_key='order.*.created' # topic pattern
)
# Consumer with manual ack
channel.basic_qos(prefetch_count=10)
def callback(ch, method, properties, body):
try:
process_order(body)
ch.basic_ack(delivery_tag=method.delivery_tag)
except Exception:
ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)
channel.basic_consume(
queue='notification-service',
on_message_callback=callback
)
When to Use RabbitMQ vs Kafka
| Criteria | RabbitMQ | Kafka |
|---|---|---|
| Model | Smart broker, dumb consumer | Dumb broker, smart consumer |
| Throughput | ~50k msg/s per node | ~1M+ msg/s per broker |
| Retention | Messages deleted after ack | Retained for configurable period |
| Replay | Not supported | Full replay by resetting offset |
| Routing | Rich (exchanges, bindings, patterns) | Simple (topics, partitions) |
| Protocol | AMQP, STOMP, MQTT | Custom binary protocol |
| Best for | Complex routing, task queues, RPC | Event streaming, log aggregation, high throughput |
AWS SQS & SNS
AWS provides fully managed messaging services that eliminate the operational overhead of running Kafka or RabbitMQ clusters yourself.
Amazon SQS (Simple Queue Service)
SQS is a fully managed, serverless message queue. You don't provision brokers, manage partitions, or worry about replication—AWS handles all of it.
Standard vs FIFO Queues
| Feature | Standard | FIFO |
|---|---|---|
| Throughput | Nearly unlimited | 300 msg/s (3,000 with batching) |
| Ordering | Best-effort (may be out of order) | Strict FIFO within message group |
| Delivery | At-least-once (may duplicate) | Exactly-once processing |
| Queue name | Any name | Must end in .fifo |
Key SQS Concepts
- Visibility Timeout: When a consumer receives a message, it becomes invisible to other consumers for the timeout period (default 30s). If the consumer doesn't delete the message in time, it reappears in the queue for another consumer. Set this to slightly longer than your max processing time.
- Long Polling: Instead of repeatedly asking "any messages?" (short polling), long polling waits up to 20 seconds for a message to arrive. Reduces empty responses and API costs by up to 90%.
- Dead Letter Queue (DLQ): After N failed processing attempts (
maxReceiveCount), messages are moved to a DLQ for inspection. Critical for debugging poisoned messages. - Message Retention: 1 minute to 14 days (default 4 days).
- Message Size: Up to 256 KB. For larger payloads, store in S3 and send a reference.
# AWS CLI: Create FIFO queue with DLQ
aws sqs create-queue \
--queue-name orders.fifo \
--attributes '{
"FifoQueue": "true",
"ContentBasedDeduplication": "true",
"VisibilityTimeout": "60",
"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789:orders-dlq.fifo\",\"maxReceiveCount\":\"3\"}"
}'
Amazon SNS (Simple Notification Service)
SNS is a fully managed pub/sub service. You publish a message to a topic, and SNS delivers it to all subscribed endpoints: SQS queues, Lambda functions, HTTP endpoints, email, SMS, or mobile push.
SNS + SQS: The Fan-out Pattern
The most common production pattern combines SNS and SQS. An event is published to an SNS topic, which fans out to multiple SQS queues. Each queue has its own consumer(s) that process independently and at their own pace.
Order Service
|
|--> Publish to SNS topic "order-events"
|
|--> SQS: notification-queue --> Notification Service
|--> SQS: inventory-queue --> Inventory Service
|--> SQS: analytics-queue --> Analytics Service
|--> Lambda: fraud-check --> Fraud Detection
This pattern gives you the best of both worlds: broadcast delivery (SNS) with independent, reliable, buffered processing (SQS).
Delivery Guarantees
One of the most important concepts in messaging systems is the delivery guarantee. There are three levels, each with different trade-offs:
At-Most-Once
Fire and forget. The producer sends the message and doesn't wait for an ack. The message may be lost if the broker or network fails. No retries.
// Kafka: acks=0 → at-most-once
props.put("acks", "0"); // don't wait for broker ack
props.put("retries", 0); // no retries
Use case: Metrics collection, logging, clickstream where losing a few events is acceptable.
At-Least-Once
The producer retries until it receives an acknowledgment. The message is guaranteed to be delivered, but may be delivered multiple times (duplicates). This is the most common guarantee in production systems.
// Kafka: acks=all + retries → at-least-once
props.put("acks", "all");
props.put("retries", Integer.MAX_VALUE);
props.put("max.in.flight.requests.per.connection", "5");
Challenge: Consumers must be idempotent—processing the same message twice must produce the same result.
Exactly-Once
The hardest guarantee. Each message is delivered and processed exactly once, with no duplicates and no losses. Kafka achieves this through idempotent producers and transactions:
// Kafka exactly-once semantics (EOS)
props.put("enable.idempotence", "true"); // dedup at broker
props.put("transactional.id", "order-processor"); // enable transactions
// Producer transaction
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("output-topic", key, value));
// Commit consumer offsets AND producer records atomically
producer.sendOffsetsToTransaction(offsets, consumerGroupId);
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
}
Idempotent Consumer Strategies
- Deduplication by ID: Store processed message IDs in a database. Before processing, check if the ID exists. Use a unique
message_idoridempotency_key. - Database upserts: Use
INSERT ... ON CONFLICT DO UPDATEinstead of plain inserts. Reprocessing the same record just overwrites with the same data. - Idempotent operations: "Set balance to $100" is idempotent. "Add $10 to balance" is not. Design your events as state assertions, not deltas.
- Conditional writes: Use optimistic locking. "Update if version = 5" will fail on duplicate processing if version is already 6.
Backpressure
Backpressure occurs when consumers can't keep up with the rate of incoming messages. The queue grows unboundedly, consuming memory and disk until the system fails. Handling backpressure is critical for production systems.
Symptoms
- Queue depth (lag) growing continuously
- Consumer processing latency increasing
- Memory/disk usage on broker climbing
- Message age in queue exceeding SLAs
Strategies
- Scale consumers horizontally: Add more consumer instances. In Kafka, add consumers to the group (up to the number of partitions). In SQS, spin up more Lambda functions or ECS tasks.
- Rate-limit producers: If consumers can't keep up, slow down the producers. Use token bucket or leaky bucket rate limiters. This pushes backpressure upstream.
- Increase batch size: Consumers that process messages in batches (e.g., bulk database inserts) are orders of magnitude faster than one-at-a-time processing.
- Buffer management: Set maximum queue lengths. When the queue is full, either reject new messages (fail-fast), drop oldest messages, or apply backpressure to the producer.
- Circuit breaker: If the consumer's downstream dependency (e.g., database) is overloaded, stop consuming temporarily rather than accumulating failed messages.
# Monitoring Kafka consumer lag
kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--group order-processor \
--describe
# Output:
# GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
# order-processor orders 0 15234 15240 6
# order-processor orders 1 18901 21500 2599 ← problem!
# order-processor orders 2 17003 17003 0
Dead Letter Queues
A Dead Letter Queue (DLQ) is a special queue where messages go when they can't be processed successfully. Think of it as the "undeliverable mail" bin at the post office.
When Messages Go to DLQ
- Max retries exceeded: After N failed processing attempts, the message is moved to the DLQ instead of being retried forever.
- Message rejected: Consumer explicitly rejects the message (e.g., invalid payload, schema mismatch).
- TTL expired: Message sat in the queue longer than its time-to-live.
- Queue overflow: Queue reached its maximum length, and oldest messages are evicted to DLQ.
Retry Strategy: Exponential Backoff
Don't retry immediately after a failure—the downstream system is probably still broken. Use exponential backoff with jitter:
// Retry delays: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped)
function getRetryDelay(attempt) {
const baseDelay = 1000; // 1 second
const maxDelay = 60000; // 60 seconds cap
const exponential = baseDelay * Math.pow(2, attempt);
const capped = Math.min(exponential, maxDelay);
const jitter = capped * (0.5 + Math.random() * 0.5); // 50-100%
return jitter;
}
// Multi-stage retry with separate queues:
// main-queue → retry-queue-1 (1 min delay)
// → retry-queue-2 (5 min delay)
// → retry-queue-3 (30 min delay)
// → dead-letter-queue (manual intervention)
Handling DLQ Messages
- Alert immediately: DLQ growth is a critical signal. Set up alerts (CloudWatch, PagerDuty) when DLQ depth > 0.
- Inspect & diagnose: Log the message body, headers, and error reason. Most DLQ messages are caused by schema changes, null fields, or downstream outages.
- Replay after fix: Once the bug is fixed, replay DLQ messages back to the main queue. AWS provides
StartMessageMoveTaskAPI for SQS DLQ redrive. - Poison pill isolation: Some messages are permanently unprocessable (corrupt data, impossible state). After manual review, archive or delete them.
Event Streaming vs Message Queuing
These terms are often used interchangeably, but they represent fundamentally different paradigms:
| Dimension | Message Queue | Event Stream |
|---|---|---|
| Data model | Message (command or task) | Event (immutable fact) |
| After consumption | Deleted from queue | Retained in log |
| Replay | Not possible | Reset offset to replay |
| Ordering | FIFO within queue | Ordered within partition |
| Consumer groups | Competing consumers (one gets it) | Each group gets all messages |
| Backpressure | Queue grows, may need DLQ | Consumer controls pace via offset |
| Examples | RabbitMQ, SQS, ActiveMQ | Kafka, Kinesis, Pulsar, Redpanda |
When to Use Each
- Message Queue: When you have tasks to distribute. Each task should be processed by exactly one consumer. Examples: send email, resize image, process payment. The message is a command ("do this").
- Event Stream: When you have events (facts) that multiple services care about. The event is a notification ("this happened"). Examples: user signed up, order placed, payment received. Multiple independent consumers process the same event differently.
Stream Processing
Event streams enable real-time stream processing—continuously transforming, filtering, aggregating, and joining streams of events.
- Kafka Streams: A Java library (not a separate cluster) for building stream processing applications. Supports stateful operations (aggregations, joins, windowing) with exactly-once semantics. Runs embedded in your application.
- Apache Flink: A distributed stream processing framework. Handles complex event processing (CEP), windowed aggregations, exactly-once state, and event-time processing. Used by Uber, Alibaba, Netflix.
- Apache Spark Structured Streaming: Micro-batch stream processing. Higher latency than Flink but integrates well with the Spark ecosystem for ML pipelines.
// Kafka Streams: Real-time order aggregation
StreamsBuilder builder = new StreamsBuilder();
KStream<String, Order> orders = builder.stream("orders");
// Count orders per product in 5-minute windows
KTable<Windowed<String>, Long> productCounts = orders
.groupBy((key, order) -> order.getProductId())
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
.count(Materialized.as("product-counts"));
// Write results to output topic
productCounts.toStream()
.map((windowedKey, count) -> KeyValue.pair(
windowedKey.key(),
windowedKey.window().start() + ": " + count))
.to("product-order-counts");
▶ Pub/Sub Fan-out: Order Processing
Watch a single order event fan out to multiple services through a topic.
Message Queues in System Design
Message queues appear in almost every system design interview. Here's how to use them effectively.
When to Introduce a Queue
- Async work: Any operation that doesn't need to complete before responding to the user. "Your order has been placed" → queue the order processing.
- Rate mismatch: Producer generates events faster than the consumer can process. Classic example: web servers (fast) → data pipeline (slow).
- Decoupling services: When you want services to evolve independently. Adding a new subscriber shouldn't require changes to the publisher.
- Reliability: When you can't afford to lose work. Persisting to a queue is your safety net against consumer failures.
- Spike absorption: Flash sales, viral content, breaking news—queues absorb spikes that would otherwise crash your backend.
Common Interview Patterns
| System | Queue Use Case | Technology Choice |
|---|---|---|
| URL Shortener | Analytics event streaming (click tracking) | Kafka (high throughput, replay) |
| E-commerce | Order processing pipeline (payment, inventory, shipping) | SNS + SQS (fan-out + reliable delivery) |
| Chat Application | Message delivery to offline users | SQS per user or Kafka per-user partition |
| Notification System | Enqueue and prioritize notifications | RabbitMQ (priority queues, routing) |
| Log Aggregation | Collect logs from all services for centralized processing | Kafka (high volume, retention) |
| Rate Limiter | Smooth out bursty request patterns | SQS + Lambda (serverless scaling) |
Sizing Queues
In an interview, you'll need to estimate queue capacity:
// Example: E-commerce order queue sizing
Orders per day: 1,000,000
Peak multiplier: 10x (flash sale)
Peak orders/sec: ~115 orders/sec × 10 = 1,150/sec
Avg message size: 2 KB (JSON payload)
Throughput needed: 1,150 × 2 KB = 2.3 MB/s
Daily storage: 1M × 2 KB = 2 GB
With 7-day retention (Kafka): 14 GB per partition set
Partitions needed: ~12 (for parallelism across 12 consumers)
// SQS: No sizing needed (serverless, auto-scales)
// Kafka: 12 partitions × 3 replicas = 36 partition replicas
// across a 3-broker cluster (12 per broker)
Monitoring Essentials
- Queue Depth / Consumer Lag: The #1 metric. If it's growing, consumers are falling behind.
- Message Age: How long the oldest message has been in the queue. Indicates processing staleness.
- Consumer Throughput: Messages processed per second per consumer. Used for scaling decisions.
- Error Rate: Percentage of messages that fail processing. Indicates code bugs or dependency issues.
- DLQ Depth: Must be zero in steady state. Any non-zero value triggers alerts.
- Producer Latency: Time to publish a message. High latency means the broker is overloaded.