← All Posts
High Level Design Series · Building Blocks · Part 1

Message Queues & Event Streaming

Why Message Queues?

In any distributed system, services need to communicate. The simplest approach is synchronous request-response: Service A calls Service B, waits for a response, then continues. This works until it doesn't—Service B goes down, gets overloaded, or is simply too slow. Now Service A is stuck, and the failure cascades.

Message queues solve this by introducing an intermediary buffer between producers (senders) and consumers (receivers). Instead of calling Service B directly, Service A drops a message into a queue and moves on. Service B picks it up when it's ready.

The Postal Service Analogy

Think of synchronous communication as a phone call: both parties must be available at the same time. A message queue is like the postal service: you drop a letter in the mailbox and go about your day. The postal service handles delivery, buffering (holding mail at the post office if the recipient isn't home), and retry (redelivery attempts). The sender and receiver are completely decoupled in time and space.

Core Benefits

Synchronous vs Asynchronous Communication

AspectSynchronous (HTTP/gRPC)Asynchronous (Queue)
CouplingTight—caller blocks until responseLoose—fire and forget
LatencyDependent on downstream serviceProducer returns immediately
Failure HandlingCascading failuresMessages buffered; consumer retries
ScalingBoth sides must scale togetherProducer and consumer scale independently
Use CaseRead queries, real-time responsesBackground jobs, event processing, notifications

▶ Producer-Consumer Flow

Watch messages flow from producers through a queue to consumers with load balancing.

Point-to-Point vs Pub/Sub

There are two fundamental messaging patterns, and understanding when to use each is critical for system design.

Point-to-Point (Queue)

In the point-to-point model, a message is produced by one producer and consumed by exactly one consumer. Once a consumer acknowledges a message, it's removed from the queue. If multiple consumers listen on the same queue, the broker distributes messages across them for load balancing (competing consumers pattern).

Producer --> [Queue] --> Consumer A  (gets M1, M3, M5)
                     --> Consumer B  (gets M2, M4, M6)

Use cases: task distribution (send emails, process images, run batch jobs), work queues where each task must be processed exactly once.

Publish/Subscribe (Topic)

In the pub/sub model, a message is published to a topic, and every subscriber to that topic receives a copy of the message. The publisher doesn't know or care how many subscribers exist.

Publisher --> [Topic: order-created]
                |--> Notification Service (sends email)
                |--> Inventory Service   (reserves stock)
                |--> Analytics Service   (records metrics)

Use cases: event-driven architectures, broadcasting state changes, fan-out to multiple downstream services.

Fan-out Pattern

Fan-out combines pub/sub with point-to-point. A message is published to a topic, which fans out to multiple queues, each with its own competing consumers. This gives you both broadcast delivery and independent scaling per consumer group.

Comparison

FeaturePoint-to-PointPub/Sub
Consumers per messageExactly oneAll subscribers
Message lifetimeDeleted after ackRetained per retention policy
Use caseTask distribution, work queuesEvent broadcasting, fan-out
CouplingProducer knows queue existsPublisher knows only the topic
ExamplesSQS, RabbitMQ queuesSNS, Kafka topics, RabbitMQ fanout exchange

Apache Kafka Deep Dive

Apache Kafka is a distributed event streaming platform capable of handling trillions of events per day. Unlike traditional message queues, Kafka is designed as a distributed commit log—messages are persisted to disk and can be replayed.

Core Concepts

Write Path

Producer
  |
  |--> Serialize record (key, value, timestamp, headers)
  |--> Determine partition (hash(key) % num_partitions, or round-robin)
  |--> Send to partition leader broker
  |
Leader Broker
  |--> Append to partition log segment on disk (sequential I/O)
  |--> Replicate to follower brokers (in-sync replicas / ISR)
  |--> Once min.insync.replicas have acknowledged → ack to producer
  |
Producer receives ack (based on acks setting):
  acks=0  → fire and forget (no ack waited)
  acks=1  → leader wrote it (fast, slight risk)
  acks=all → all ISR replicas wrote it (safest)

Read Path

Consumer (in a Consumer Group)
  |
  |--> Assigned partitions by Group Coordinator
  |--> Fetches records starting from committed offset
  |--> Processes batch of records
  |--> Commits offset (auto-commit or manual)
  |
Offset Storage: __consumer_offsets internal topic
  |--> Tracks (group, topic, partition) → offset
  |--> Enables restart from last committed position

Ordering Guarantees

Kafka guarantees ordering within a single partition only. If you need all events for a given entity (e.g., user_id=42) to be in order, use that entity's ID as the message key. All records with the same key go to the same partition via consistent hashing.

Design Tip: Choose partition keys carefully. Using user_id as a key ensures all events for a user are ordered. Using a random key maximizes throughput but destroys ordering. Hot keys (e.g., a single celebrity's user_id getting 90% of traffic) create partition skew.

Retention Policies

Unlike traditional queues, Kafka doesn't delete messages after consumption. Messages are retained based on:

Production Configuration

# Broker configuration (server.properties) num.partitions=12 default.replication.factor=3 min.insync.replicas=2 unclean.leader.election.enable=false log.retention.hours=168 log.segment.bytes=1073741824 num.io.threads=8 num.network.threads=3 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 # Producer configuration acks=all retries=2147483647 max.in.flight.requests.per.connection=5 enable.idempotence=true compression.type=lz4 batch.size=65536 linger.ms=5 buffer.memory=33554432 # Consumer configuration enable.auto.commit=false auto.offset.reset=earliest max.poll.records=500 max.poll.interval.ms=300000 session.timeout.ms=30000 heartbeat.interval.ms=10000

Performance Characteristics

Kafka achieves extraordinary throughput through several design choices:

A single Kafka broker can handle hundreds of MB/s of writes and reads. LinkedIn's Kafka clusters process 7+ trillion messages per day.

▶ Kafka Partitioning & Consumer Groups

See how messages are routed to partitions by key and consumed in parallel by a consumer group.

RabbitMQ

RabbitMQ is a traditional message broker implementing the Advanced Message Queuing Protocol (AMQP). Unlike Kafka's log-based approach, RabbitMQ is a smart broker / dumb consumer model: the broker manages message routing, delivery, and acknowledgment.

AMQP Architecture

RabbitMQ introduces the concept of exchanges—routers that sit between producers and queues:

Producer → Exchange → Binding → Queue → Consumer
              ↑                    ↑
         routing logic        binding key/pattern

Exchange Types

ExchangeRouting LogicExample Use Case
DirectMessage routing key must exactly match the binding keyRoute payment.success to payment queue
TopicRouting key matches a pattern with * (one word) and # (zero or more words)order.*.created matches order.us.created
FanoutIgnores routing key; broadcasts to all bound queuesBroadcast order events to all services
HeadersRoutes based on message header attributes (key-value pairs) instead of routing keyRoute by content-type, priority, or custom headers

Key Mechanisms

RabbitMQ Configuration Example

# Python with pika library
import pika

connection = pika.BlockingConnection(
    pika.ConnectionParameters('rabbitmq-host', 5672,
        credentials=pika.PlainCredentials('user', 'pass'))
)
channel = connection.channel()

# Declare durable exchange and queue
channel.exchange_declare(
    exchange='orders',
    exchange_type='topic',
    durable=True
)
channel.queue_declare(
    queue='notification-service',
    durable=True,
    arguments={
        'x-dead-letter-exchange': 'orders-dlx',
        'x-message-ttl': 300000,       # 5 min TTL
        'x-max-length': 100000         # max 100k messages
    }
)
channel.queue_bind(
    queue='notification-service',
    exchange='orders',
    routing_key='order.*.created'       # topic pattern
)

# Consumer with manual ack
channel.basic_qos(prefetch_count=10)

def callback(ch, method, properties, body):
    try:
        process_order(body)
        ch.basic_ack(delivery_tag=method.delivery_tag)
    except Exception:
        ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)

channel.basic_consume(
    queue='notification-service',
    on_message_callback=callback
)

When to Use RabbitMQ vs Kafka

CriteriaRabbitMQKafka
ModelSmart broker, dumb consumerDumb broker, smart consumer
Throughput~50k msg/s per node~1M+ msg/s per broker
RetentionMessages deleted after ackRetained for configurable period
ReplayNot supportedFull replay by resetting offset
RoutingRich (exchanges, bindings, patterns)Simple (topics, partitions)
ProtocolAMQP, STOMP, MQTTCustom binary protocol
Best forComplex routing, task queues, RPCEvent streaming, log aggregation, high throughput

AWS SQS & SNS

AWS provides fully managed messaging services that eliminate the operational overhead of running Kafka or RabbitMQ clusters yourself.

Amazon SQS (Simple Queue Service)

SQS is a fully managed, serverless message queue. You don't provision brokers, manage partitions, or worry about replication—AWS handles all of it.

Standard vs FIFO Queues

FeatureStandardFIFO
ThroughputNearly unlimited300 msg/s (3,000 with batching)
OrderingBest-effort (may be out of order)Strict FIFO within message group
DeliveryAt-least-once (may duplicate)Exactly-once processing
Queue nameAny nameMust end in .fifo

Key SQS Concepts

# AWS CLI: Create FIFO queue with DLQ
aws sqs create-queue \
  --queue-name orders.fifo \
  --attributes '{
    "FifoQueue": "true",
    "ContentBasedDeduplication": "true",
    "VisibilityTimeout": "60",
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789:orders-dlq.fifo\",\"maxReceiveCount\":\"3\"}"
  }'

Amazon SNS (Simple Notification Service)

SNS is a fully managed pub/sub service. You publish a message to a topic, and SNS delivers it to all subscribed endpoints: SQS queues, Lambda functions, HTTP endpoints, email, SMS, or mobile push.

SNS + SQS: The Fan-out Pattern

The most common production pattern combines SNS and SQS. An event is published to an SNS topic, which fans out to multiple SQS queues. Each queue has its own consumer(s) that process independently and at their own pace.

Order Service
  |
  |--> Publish to SNS topic "order-events"
         |
         |--> SQS: notification-queue  --> Notification Service
         |--> SQS: inventory-queue     --> Inventory Service
         |--> SQS: analytics-queue     --> Analytics Service
         |--> Lambda: fraud-check      --> Fraud Detection

This pattern gives you the best of both worlds: broadcast delivery (SNS) with independent, reliable, buffered processing (SQS).

Delivery Guarantees

One of the most important concepts in messaging systems is the delivery guarantee. There are three levels, each with different trade-offs:

At-Most-Once

Fire and forget. The producer sends the message and doesn't wait for an ack. The message may be lost if the broker or network fails. No retries.

// Kafka: acks=0 → at-most-once
props.put("acks", "0");      // don't wait for broker ack
props.put("retries", 0);     // no retries

Use case: Metrics collection, logging, clickstream where losing a few events is acceptable.

At-Least-Once

The producer retries until it receives an acknowledgment. The message is guaranteed to be delivered, but may be delivered multiple times (duplicates). This is the most common guarantee in production systems.

// Kafka: acks=all + retries → at-least-once
props.put("acks", "all");
props.put("retries", Integer.MAX_VALUE);
props.put("max.in.flight.requests.per.connection", "5");

Challenge: Consumers must be idempotent—processing the same message twice must produce the same result.

Exactly-Once

The hardest guarantee. Each message is delivered and processed exactly once, with no duplicates and no losses. Kafka achieves this through idempotent producers and transactions:

// Kafka exactly-once semantics (EOS)
props.put("enable.idempotence", "true");         // dedup at broker
props.put("transactional.id", "order-processor"); // enable transactions

// Producer transaction
producer.initTransactions();
try {
    producer.beginTransaction();
    producer.send(new ProducerRecord<>("output-topic", key, value));
    // Commit consumer offsets AND producer records atomically
    producer.sendOffsetsToTransaction(offsets, consumerGroupId);
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}
Reality Check: True exactly-once is only achievable within Kafka (Kafka-to-Kafka). Once you interact with external systems (databases, APIs), you're back to at-least-once with idempotent consumers. Design for at-least-once and make your consumers idempotent.

Idempotent Consumer Strategies

Backpressure

Backpressure occurs when consumers can't keep up with the rate of incoming messages. The queue grows unboundedly, consuming memory and disk until the system fails. Handling backpressure is critical for production systems.

Symptoms

Strategies

# Monitoring Kafka consumer lag
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group order-processor \
  --describe

# Output:
# GROUP          TOPIC          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# order-processor orders        0          15234           15240           6
# order-processor orders        1          18901           21500           2599  ← problem!
# order-processor orders        2          17003           17003           0
Alert on lag, not just errors. A consumer might be "working fine" with zero errors but falling behind. Monitor consumer lag (difference between latest offset and consumer's committed offset) and alert when it exceeds a threshold.

Dead Letter Queues

A Dead Letter Queue (DLQ) is a special queue where messages go when they can't be processed successfully. Think of it as the "undeliverable mail" bin at the post office.

When Messages Go to DLQ

Retry Strategy: Exponential Backoff

Don't retry immediately after a failure—the downstream system is probably still broken. Use exponential backoff with jitter:

// Retry delays: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped)
function getRetryDelay(attempt) {
    const baseDelay = 1000;          // 1 second
    const maxDelay  = 60000;         // 60 seconds cap
    const exponential = baseDelay * Math.pow(2, attempt);
    const capped = Math.min(exponential, maxDelay);
    const jitter = capped * (0.5 + Math.random() * 0.5);  // 50-100%
    return jitter;
}

// Multi-stage retry with separate queues:
// main-queue → retry-queue-1 (1 min delay)
//            → retry-queue-2 (5 min delay)
//            → retry-queue-3 (30 min delay)
//            → dead-letter-queue (manual intervention)

Handling DLQ Messages

Event Streaming vs Message Queuing

These terms are often used interchangeably, but they represent fundamentally different paradigms:

DimensionMessage QueueEvent Stream
Data modelMessage (command or task)Event (immutable fact)
After consumptionDeleted from queueRetained in log
ReplayNot possibleReset offset to replay
OrderingFIFO within queueOrdered within partition
Consumer groupsCompeting consumers (one gets it)Each group gets all messages
BackpressureQueue grows, may need DLQConsumer controls pace via offset
ExamplesRabbitMQ, SQS, ActiveMQKafka, Kinesis, Pulsar, Redpanda

When to Use Each

Stream Processing

Event streams enable real-time stream processing—continuously transforming, filtering, aggregating, and joining streams of events.

// Kafka Streams: Real-time order aggregation
StreamsBuilder builder = new StreamsBuilder();

KStream<String, Order> orders = builder.stream("orders");

// Count orders per product in 5-minute windows
KTable<Windowed<String>, Long> productCounts = orders
    .groupBy((key, order) -> order.getProductId())
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .count(Materialized.as("product-counts"));

// Write results to output topic
productCounts.toStream()
    .map((windowedKey, count) -> KeyValue.pair(
        windowedKey.key(),
        windowedKey.window().start() + ": " + count))
    .to("product-order-counts");

▶ Pub/Sub Fan-out: Order Processing

Watch a single order event fan out to multiple services through a topic.

Message Queues in System Design

Message queues appear in almost every system design interview. Here's how to use them effectively.

When to Introduce a Queue

Common Interview Patterns

SystemQueue Use CaseTechnology Choice
URL ShortenerAnalytics event streaming (click tracking)Kafka (high throughput, replay)
E-commerceOrder processing pipeline (payment, inventory, shipping)SNS + SQS (fan-out + reliable delivery)
Chat ApplicationMessage delivery to offline usersSQS per user or Kafka per-user partition
Notification SystemEnqueue and prioritize notificationsRabbitMQ (priority queues, routing)
Log AggregationCollect logs from all services for centralized processingKafka (high volume, retention)
Rate LimiterSmooth out bursty request patternsSQS + Lambda (serverless scaling)

Sizing Queues

In an interview, you'll need to estimate queue capacity:

// Example: E-commerce order queue sizing
Orders per day:     1,000,000
Peak multiplier:    10x (flash sale)
Peak orders/sec:    ~115 orders/sec × 10 = 1,150/sec
Avg message size:   2 KB (JSON payload)

Throughput needed:  1,150 × 2 KB = 2.3 MB/s
Daily storage:      1M × 2 KB = 2 GB
With 7-day retention (Kafka): 14 GB per partition set
Partitions needed:  ~12 (for parallelism across 12 consumers)

// SQS: No sizing needed (serverless, auto-scales)
// Kafka: 12 partitions × 3 replicas = 36 partition replicas
//        across a 3-broker cluster (12 per broker)

Monitoring Essentials

System Design Checklist: When you introduce a queue in an interview, always discuss: (1) which technology and why, (2) delivery guarantee needed, (3) ordering requirements, (4) what happens when the consumer fails (DLQ strategy), (5) how you'll monitor queue health, and (6) scaling plan for consumers.