High Level Design Series · Architecture Patterns · Part 1· Post 37 of 70

Microservices Architecture

April 2026 · 22 min read

The Monolith — Where Every System Begins

A monolith is a single deployable unit where all business logic lives in one codebase, one process, and one database. It’s not inherently bad — every successful company started with one:

Company	Original Monolith	When They Split	Trigger
Amazon	Perl/C++ single app (Obidos)	~2002	Deployment took 8+ hours; teams blocked each other
Netflix	Java monolith on Oracle DB	2008–2012	3-day database corruption outage
Twitter	Ruby on Rails “fail whale”	2010–2013	Couldn’t handle World Cup traffic spikes
Uber	PHP monolith	2014–2016	Dispatch logic coupled to payment, GPS, and notifications
Shopify	Rails monolith (still monolith!)	Ongoing modularisation	Chose modular monolith over full microservices

Key insight: The monolith isn’t the problem — a poorly structured monolith is the problem. A well-modularised monolith (like Shopify’s) can scale to billions of dollars in GMV. The question is: at what point does the organisational pain of a monolith exceed the operational pain of distributed systems?

Monolith Pain Points That Drive Migration

# The classic monolith deployment problem:
# 1. Developer changes 3 lines in the payment module
# 2. Entire application must be rebuilt (45 min)
# 3. Full test suite runs (2+ hours)
# 4. Deployment window: Saturday 2 AM
# 5. QA must re-verify search, inventory, and recommendations
#    even though they didn't change
# 6. If payment breaks, rollback rolls back EVERYTHING

# Merge conflicts: 200+ developers, 1 repo, 1 branch strategy
$ git pull origin main
# CONFLICT: 47 files changed by 12 different teams
# "Who changed the User model AGAIN?"

# Scaling nightmare:
# Search needs 16 CPU cores but only 2 GB RAM
# Image processing needs 2 cores but 64 GB RAM
# But they're the same binary — you can't scale them independently
# So you run 20 instances of the ENTIRE app on 64 GB machines
# just because ONE module needs memory

What Are Microservices?

Microservices is an architectural style where a system is composed of small, independently deployable services, each owning a specific business capability, communicating over well-defined APIs.

The key properties are:

Single responsibility: Each service does one thing well (orders, payments, inventory, search).
Independent deployment: Change and deploy the payment service without touching search.
Own data store: Each service owns its database. No shared databases.
Technology heterogeneity: Search in Elasticsearch, payments in Java, recommendations in Python — each team picks what’s best.
Decentralised governance: No central architecture committee dictating frameworks.
Designed for failure: Every network call can fail. Build resilience in from day one.

# Microservices architecture for an e-commerce platform:
#
#  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
#  │  API     │  │  Web     │  │  Mobile  │  │  Partner │
#  │  Gateway │  │  App     │  │  App     │  │  API     │
#  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
#       │              │              │              │
#  ─────┴──────────────┴──────────────┴──────────────┴─────
#                    API Gateway / Load Balancer
#  ────────────────────────────────────────────────────────
#       │         │         │         │         │
#  ┌────▼───┐ ┌───▼────┐ ┌──▼───┐ ┌──▼───┐ ┌───▼────┐
#  │ User   │ │ Product│ │Order │ │Pay-  │ │ Search │
#  │Service │ │Service │ │Svc   │ │ment  │ │Service │
#  │(Go)    │ │(Java)  │ │(Java)│ │(Java)│ │(Python)│
#  └───┬────┘ └───┬────┘ └──┬───┘ └──┬───┘ └───┬────┘
#      │          │          │        │         │
#  ┌───▼───┐ ┌───▼───┐ ┌───▼──┐ ┌───▼──┐ ┌───▼────────┐
#  │Postgres│ │Postgres│ │MySQL │ │MySQL │ │Elasticsearch│
#  └───────┘ └───────┘ └──────┘ └──────┘ └────────────┘

Service Boundaries & Bounded Contexts

The hardest part of microservices isn’t the technology — it’s drawing the right boundaries. Get them wrong and you’ll build a distributed monolith: all the complexity of distributed systems with none of the benefits.

Domain-Driven Design (DDD) provides the conceptual framework. A bounded context is a boundary within which a particular domain model is defined and applicable.

Example: E-Commerce Bounded Contexts

# The word "Product" means different things in different contexts:
#
# Catalog Context:
#   Product = name, description, images, categories, SEO metadata
#   Operations: browse, search, filter, compare
#
# Inventory Context:
#   Product = SKU, warehouse location, quantity, reorder threshold
#   Operations: reserve, restock, count, transfer between warehouses
#
# Pricing Context:
#   Product = base price, discounts, tax rules, currency conversions
#   Operations: calculate price, apply coupon, dynamic pricing
#
# Shipping Context:
#   Product = weight, dimensions, fragility flag, hazmat classification
#   Operations: calculate shipping cost, estimate delivery date

# WRONG: One "Product" service that handles everything
# RIGHT: Four services, each with its own "Product" model

# Each context has its OWN representation:
class CatalogProduct:
    id: str
    name: str
    description: str
    images: list[str]
    categories: list[str]

class InventoryItem:
    sku: str
    product_id: str           # reference to catalog
    warehouse_id: str
    quantity_available: int
    quantity_reserved: int
    reorder_point: int

class PricingEntry:
    product_id: str           # reference to catalog
    base_price: Decimal
    currency: str
    discount_rules: list[DiscountRule]
    tax_category: str

class ShippableItem:
    product_id: str           # reference to catalog
    weight_kg: float
    dimensions_cm: tuple[float, float, float]
    is_fragile: bool
    hazmat_class: str | None

How to Identify Boundaries

Event Storming: Gather domain experts and developers in a room. Write every domain event on sticky notes (“Order Placed”, “Payment Received”, “Item Shipped”). Group related events — each group suggests a bounded context.
Linguistic boundaries: When the same word means different things to different teams, you’ve found a boundary. “Account” means “user profile” to the identity team and “billing entity” to finance.
Change cadence: Features that change together should live together. If search changes weekly but shipping rules change quarterly, they’re separate contexts.
Data ownership: Who is the authoritative source for this data? Whoever writes it, owns the service. Everybody else reads a copy or calls an API.

The Distributed Monolith Test: If you can’t deploy service A without also deploying services B and C, you have a distributed monolith. Signs: lock-step releases, shared databases, synchronous chains of 5+ calls, shared libraries with business logic.

Migration Visualised: Monolith → Microservices

Step through the progressive extraction of a monolith into independent microservices.

▶ Monolith to Microservices

Communication: Sync vs Async

Services must communicate. The choice between synchronous and asynchronous communication is one of the most consequential architectural decisions.

Synchronous Communication (Request/Response)

Service A sends a request to Service B and waits for the response. The caller is blocked until the response arrives (or a timeout fires).

Protocol	Format	Use Case	Latency
REST/HTTP	JSON	Public APIs, CRUD, browser-facing	1–50 ms (intra-DC)
gRPC	Protocol Buffers (binary)	Internal service-to-service, streaming	0.5–10 ms
GraphQL	JSON	API gateway aggregation, mobile clients	Variable (depends on resolvers)

# REST example — Order Service calls Payment Service
import requests

class OrderService:
    def place_order(self, order):
        # SYNCHRONOUS: we block here waiting for payment
        response = requests.post(
            "http://payment-service:8080/api/v1/charges",
            json={
                "order_id": order.id,
                "amount": order.total,
                "currency": "USD",
                "customer_id": order.customer_id
            },
            timeout=5  # 5 second timeout — CRITICAL
        )
        if response.status_code == 201:
            order.status = "CONFIRMED"
            # What if the DB write fails here?
            # Payment was charged but order not confirmed!
        else:
            order.status = "PAYMENT_FAILED"

# gRPC example — much faster, strongly typed
# payment.proto
# syntax = "proto3";
# service PaymentService {
#   rpc ChargeCustomer(ChargeRequest) returns (ChargeResponse);
#   rpc StreamPayments(PaymentFilter) returns (stream Payment);
# }
# message ChargeRequest {
#   string order_id = 1;
#   int64 amount_cents = 2;
#   string currency = 3;
#   string customer_id = 4;
# }

Asynchronous Communication (Event-Driven)

Service A publishes an event/message and continues immediately. Service B processes it later. The producer and consumer are decoupled in time.

# Event-driven: Order Service publishes, Payment Service subscribes
import json
from kafka import KafkaProducer, KafkaConsumer

# --- ORDER SERVICE (Producer) ---
producer = KafkaProducer(
    bootstrap_servers='kafka:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

class OrderService:
    def place_order(self, order):
        order.status = "PENDING_PAYMENT"
        self.save(order)

        # ASYNCHRONOUS: publish and continue — don't wait
        producer.send('order-events', {
            "event_type": "OrderPlaced",
            "order_id": order.id,
            "amount": str(order.total),
            "currency": "USD",
            "customer_id": order.customer_id,
            "timestamp": datetime.utcnow().isoformat()
        })
        # Order service is FREE to handle the next request immediately

# --- PAYMENT SERVICE (Consumer) ---
consumer = KafkaConsumer(
    'order-events',
    bootstrap_servers='kafka:9092',
    group_id='payment-service',
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

for message in consumer:
    event = message.value
    if event["event_type"] == "OrderPlaced":
        result = charge_customer(event["customer_id"], event["amount"])
        # Publish result as another event
        producer.send('payment-events', {
            "event_type": "PaymentProcessed",
            "order_id": event["order_id"],
            "status": "SUCCESS" if result.ok else "FAILED",
            "timestamp": datetime.utcnow().isoformat()
        })

Aspect	Synchronous	Asynchronous
Coupling	Temporal & spatial coupling	Decoupled (fire-and-forget)
Latency	Caller blocked until response	Caller returns immediately
Failure	Cascading failures (A→B→C all down)	Isolated (B can catch up later)
Consistency	Easier to achieve strong consistency	Eventual consistency
Debugging	Stack traces, straightforward	Distributed tracing required
Best for	Queries, reads, user-facing requests	Commands, writes, background processing

Communication Visualised

Compare how synchronous and asynchronous communication differ in timing, blocking, and throughput.

▶ Sync vs Async Communication

API Composition Pattern

In a monolith, displaying an order detail page is a single SQL JOIN. In microservices, data lives in separate services. The API Composition pattern solves this by having a composer service aggregate data from multiple sources.

# API Gateway / BFF (Backend For Frontend) — composes responses
class OrderDetailComposer:
    """
    Aggregates data from 4 services to build the order detail view.
    Uses asyncio to call all services concurrently — total latency
    is max(individual latencies), not sum.
    """
    async def get_order_detail(self, order_id: str) -> dict:
        # Fan-out: call all services in parallel
        order, customer, products, shipment = await asyncio.gather(
            self.order_service.get_order(order_id),
            self.customer_service.get_customer(order.customer_id),
            self.catalog_service.get_products(order.product_ids),
            self.shipping_service.get_shipment(order_id),
        )

        # Fan-in: compose the response
        return {
            "order_id": order.id,
            "status": order.status,
            "placed_at": order.created_at,
            "customer": {
                "name": customer.name,
                "email": customer.email
            },
            "items": [
                {
                    "product_name": p.name,
                    "quantity": order.items[p.id].quantity,
                    "price": p.price
                }
                for p in products
            ],
            "shipping": {
                "carrier": shipment.carrier,
                "tracking": shipment.tracking_number,
                "estimated_delivery": shipment.eta
            },
            "total": order.total
        }

# Latency comparison:
# Monolith: 1 SQL query = ~5 ms
# Microservices (sequential): 4 API calls × ~10ms = ~40 ms
# Microservices (parallel):   max(10, 8, 12, 9) = ~12 ms + overhead

API Gateway vs BFF: An API Gateway (Kong, AWS API Gateway) handles cross-cutting concerns: authentication, rate limiting, SSL termination, routing. A Backend-for-Frontend (BFF) is a custom composition layer per client type. Mobile BFF returns compact JSON; web BFF returns richer data. Netflix pioneered the BFF pattern with separate Java APIs for TV, mobile, and web.

Data Ownership: Database Per Service

The database-per-service pattern is the most critical (and most painful) principle of microservices. Each service owns its data. No other service can access that database directly — only through the owning service’s API.

# ✅ CORRECT: Each service owns its database
#
# Order Service → orders_db (MySQL)
#   Tables: orders, order_items, order_status_history
#   Only Order Service reads/writes these tables
#
# Payment Service → payments_db (PostgreSQL)
#   Tables: payments, refunds, payment_methods
#   Only Payment Service reads/writes these tables
#
# Inventory Service → inventory_db (PostgreSQL)
#   Tables: inventory, reservations, warehouses
#   Only Inventory Service reads/writes these tables

# ❌ ANTI-PATTERN: Shared database
#
# Order Service  ─┐
# Payment Service ─┼──→ shared_db (one big database)
# Inventory Service┘
#   All services read/write the same tables
#   Schema changes require coordinating ALL teams
#   No independent deployment — you have a distributed monolith

Why Shared Databases Are Toxic

Schema coupling: Changing a column in the users table requires coordinating every service that reads it. One team’s migration breaks another team’s queries.
Performance coupling: The analytics service runs a full table scan, locking rows that the order service needs. One slow query affects everyone.
Technology lock-in: Stuck on one database vendor because all services depend on PostgreSQL-specific features.
Deployment coupling: Can’t deploy independently if you need coordinated schema migrations.

Data Duplication Is Okay

In microservices, some data duplication is expected and healthy. The Order Service stores a snapshot of the product name and price at the time of purchase. If the Catalog Service later changes the product name, old orders still show the original name.

# Order Service stores a SNAPSHOT, not a reference
order_item = {
    "product_id": "prod-123",        # reference for linking
    "product_name": "Wireless Mouse", # snapshot — won't change
    "price_at_purchase": 29.99,       # snapshot — frozen in time
    "quantity": 2
}
# Even if the product is renamed to "Ergonomic Wireless Mouse"
# in the catalog, this order still shows "Wireless Mouse"

The Strangler Fig Pattern

Named after the strangler fig tree that grows around a host tree and eventually replaces it. This is the safest way to migrate from a monolith — incrementally, without a risky “big bang” rewrite.

# Phase 1: Intercept — Route traffic through a proxy
# All traffic goes through the proxy. Initially, 100% goes to monolith.
#
#   Client → [Proxy/API Gateway] → Monolith
#                                   (handles everything)

# Phase 2: Extract — Build new service for ONE feature
# Search is extracted first (high-value, loosely coupled)
#
#   Client → [Proxy/API Gateway] ─── /api/search → Search Service (new)
#                                └── /api/*      → Monolith (everything else)

# Phase 3: Migrate data — Dual-write or CDC
# Old search code in monolith still exists but receives no traffic.
# Use Change Data Capture (CDC) to sync data from monolith DB
# to the new Search service's Elasticsearch index.
#
#   Monolith DB ──CDC──→ Search Service (Elasticsearch)

# Phase 4: Repeat — Extract next service
#   Client → [Proxy] ─── /api/search   → Search Service
#                     ├── /api/payments → Payment Service (new)
#                     └── /api/*        → Monolith (shrinking)

# Phase 5: Decommission — Remove dead code from monolith
# After 6 months with zero traffic to monolith's search module:
# 1. Remove search code from monolith
# 2. Remove search tables from monolith DB
# 3. Shrink monolith's deployment resources
# The monolith gradually "dies" — strangled by the new services

Real-world timeline: Amazon’s migration took roughly 2 years (2002–2004). Netflix’s migration to AWS microservices took 4 years (2008–2012). Uber’s migration is still ongoing in some areas. Don’t expect to be done in a quarter. Plan for 1–3 years for a large monolith.

Service Mesh

When you have 50+ microservices, managing networking concerns (retries, timeouts, circuit breakers, mTLS, observability) in application code becomes unsustainable. A service mesh extracts this into infrastructure.

# Without service mesh — every service implements its own:
class PaymentClient:
    def charge(self, amount):
        retries = 3
        for attempt in range(retries):
            try:
                response = requests.post(
                    "http://payment-service:8080/charge",
                    json={"amount": amount},
                    timeout=5
                )
                if response.status_code == 503:
                    time.sleep(2 ** attempt)  # exponential backoff
                    continue
                return response.json()
            except requests.Timeout:
                if attempt == retries - 1:
                    raise
                time.sleep(2 ** attempt)
        # Every client, in every service, implements this same pattern
        # 200 services × 10 clients each = 2000 retry implementations
        # each slightly different, each with its own bugs

# With service mesh (Istio/Linkerd) — networking is in the sidecar:
#
#  ┌─────────────────────────────────┐
#  │           Pod                    │
#  │  ┌──────────┐  ┌──────────────┐ │
#  │  │  App     │  │  Envoy Proxy │ │
#  │  │Container │──│  (Sidecar)   │ │
#  │  │          │  │  - mTLS      │ │
#  │  │ Simple   │  │  - Retries   │ │
#  │  │ HTTP     │  │  - Timeouts  │ │
#  │  │ calls    │  │  - Circuit   │ │
#  │  │          │  │    breaker   │ │
#  │  │          │  │  - Tracing   │ │
#  │  │          │  │  - Metrics   │ │
#  │  └──────────┘  └──────────────┘ │
#  └─────────────────────────────────┘
#
# App code becomes:
class PaymentClient:
    def charge(self, amount):
        response = requests.post(
            "http://payment-service:8080/charge",
            json={"amount": amount}
        )
        return response.json()
        # That's it. Envoy handles retries, timeouts, mTLS, etc.

# Istio VirtualService — configure retries in YAML, not code
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
    - payment-service
  http:
    - route:
        - destination:
            host: payment-service
            port:
              number: 8080
      retries:
        attempts: 3
        perTryTimeout: 5s
        retryOn: 5xx,reset,connect-failure
      timeout: 15s

---
# Istio DestinationRule — circuit breaker configuration
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Distributed Tracing

In a monolith, a single stack trace shows the full request flow. In microservices, a single user request might touch 8 services. Distributed tracing (Jaeger, Zipkin, OpenTelemetry) stitches traces together across service boundaries.

# OpenTelemetry — instrument your services
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup (once at startup)
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(agent_host_name="jaeger", agent_port=6831)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer("order-service")

# Instrument your code
class OrderService:
    def place_order(self, request):
        with tracer.start_as_current_span("place_order") as span:
            span.set_attribute("order.customer_id", request.customer_id)
            span.set_attribute("order.item_count", len(request.items))

            # Each downstream call inherits the trace context
            with tracer.start_as_current_span("validate_inventory"):
                inventory = self.inventory_client.check(request.items)
                # Inventory service creates child spans automatically
                # via context propagation headers (traceparent)

            with tracer.start_as_current_span("process_payment"):
                payment = self.payment_client.charge(request.total)

            with tracer.start_as_current_span("create_order"):
                order = self.repository.save(request)

            return order

# What you see in Jaeger UI:
# ─── place_order (Order Service) ──────────── 145ms ────
#   ├── validate_inventory (Order Svc) ─ 12ms ─┐
#   │     └── check_stock (Inventory Svc) 10ms  │
#   ├── process_payment (Order Svc) ──── 98ms ─┐
#   │     ├── fraud_check (Payment Svc) -- 45ms │
#   │     └── charge_card (Payment Svc) -- 50ms │
#   └── create_order (Order Svc) ─────── 8ms   │
#         └── db_insert (Order Svc) ──── 5ms   │

The three pillars of observability in microservices: Logs (what happened), Metrics (how much/how often), Traces (the journey of a request). You need all three. OpenTelemetry unifies them under one SDK. Ship logs to ELK, metrics to Prometheus/Grafana, and traces to Jaeger.

Testing Microservices

Testing distributed systems is fundamentally harder than testing a monolith. The Testing Pyramid still applies, but new layers appear.

Contract Testing (Pact)

Consumer-driven contracts verify that a service provider meets the expectations of its consumers, without deploying both services together. The consumer writes a contract describing what it expects; the provider verifies it can fulfil that contract.

# Consumer side (Order Service expects this from Payment Service)
# order_service/tests/test_payment_contract.py
from pact import Consumer, Provider

pact = Consumer('OrderService').has_pact_with(Provider('PaymentService'))

def test_charge_customer():
    expected_body = {
        "payment_id": "pay-abc123",
        "status": "SUCCESS",
        "amount": 99.99
    }

    (pact
     .given("a valid customer with payment method")
     .upon_receiving("a charge request")
     .with_request("POST", "/api/v1/charges",
                    body={
                        "order_id": "ord-789",
                        "amount": 99.99,
                        "currency": "USD",
                        "customer_id": "cust-456"
                    })
     .will_respond_with(201, body=expected_body))

    with pact:
        # This test runs against a mock server
        result = PaymentClient("http://localhost:1234").charge(
            order_id="ord-789",
            amount=99.99,
            currency="USD",
            customer_id="cust-456"
        )
        assert result["status"] == "SUCCESS"

    # The pact file is published to a Pact Broker
    # Payment Service's CI pulls it and verifies:
    #   "Can I actually return what OrderService expects?"

# Provider side (Payment Service verifies the contract)
# payment_service/tests/test_verify_contracts.py
from pact import Verifier

def test_verify_order_service_contract():
    verifier = Verifier(
        provider='PaymentService',
        provider_base_url='http://localhost:8080'
    )
    output, _ = verifier.verify_pacts(
        './pacts/OrderService-PaymentService.json',
        provider_states_setup_url='http://localhost:8080/_pact/setup'
    )

The Testing Honeycomb

# Microservices testing layers (most to least):
#
#         ┌─────────────────┐
#         │   E2E Tests     │  Few — slow, flaky, expensive
#         │  (Cypress/       │  Test critical user journeys only
#         │   Playwright)   │
#         ├─────────────────┤
#         │  Integration    │  Test real DB, real message broker
#         │  Tests          │  Use Testcontainers for isolation
#         ├─────────────────┤
#         │  Contract Tests │  ← THE KEY LAYER for microservices
#         │  (Pact)         │  Verify inter-service compatibility
#         ├─────────────────┤
#         │  Component      │  Test one service in isolation
#         │  Tests          │  Mock external dependencies
#         ├─────────────────┤
#         │  Unit Tests     │  Fast, isolated, lots of them
#         │                 │  Domain logic, pure functions
#         └─────────────────┘
#
# Anti-pattern: relying on E2E tests to catch integration issues.
# A 200-service E2E test suite takes hours and breaks constantly.

Deployment: Containers & Kubernetes

Microservices and containers are natural partners. Each service is packaged as a Docker image and orchestrated by Kubernetes.

Dockerising a Service

# Dockerfile — multi-stage build for a Go service
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /order-service ./cmd/server

FROM alpine:3.19
RUN apk --no-cache add ca-certificates
COPY --from=builder /order-service /order-service
EXPOSE 8080
USER nobody
ENTRYPOINT ["/order-service"]

# Image size: ~15 MB (vs 1.2 GB for a monolith with all dependencies)
# Build time: ~30 seconds (vs 45 minutes for the full monolith)
# Startup time: ~2 seconds (vs 5 minutes for Spring Boot monolith)

Kubernetes Deployment

# k8s/order-service/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: ecommerce
  labels:
    app: order-service
    version: v2.3.1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    metadata:
      labels:
        app: order-service
        version: v2.3.1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: order-service
      containers:
        - name: order-service
          image: registry.internal/ecommerce/order-service:v2.3.1
          ports:
            - containerPort: 8080
              name: http
            - containerPort: 9090
              name: metrics
          env:
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: order-db-credentials
                  key: host
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: order-db-credentials
                  key: password
            - name: KAFKA_BROKERS
              value: "kafka-0.kafka:9092,kafka-1.kafka:9092"
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: ecommerce
spec:
  selector:
    app: order-service
  ports:
    - port: 8080
      targetPort: 8080
      name: http
    - port: 9090
      targetPort: 9090
      name: metrics
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service
  namespace: ecommerce
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

CI/CD Per Service

# .github/workflows/order-service.yml
name: Order Service CI/CD
on:
  push:
    paths:
      - 'services/order-service/**'   # Only triggers on changes to this service
      - '.github/workflows/order-service.yml'

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: orders_test
          POSTGRES_PASSWORD: test
        ports: ['5432:5432']
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with: { go-version: '1.22' }
      - run: cd services/order-service && go test ./... -race -cover
      - name: Contract Tests
        run: |
          cd services/order-service
          go test ./contracts/... -tags=contract

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/build-push-action@v5
        with:
          context: services/order-service
          push: true
          tags: |
            registry.internal/ecommerce/order-service:${{ github.sha }}
            registry.internal/ecommerce/order-service:latest

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: |
          kubectl set image deployment/order-service \
            order-service=registry.internal/ecommerce/order-service:${{ github.sha }} \
            -n ecommerce
          kubectl rollout status deployment/order-service -n ecommerce --timeout=300s

The Hard Parts: Challenges & Trade-offs

Distributed Transactions

In a monolith, a single database transaction guarantees ACID. With database-per-service, there’s no global transaction. The Saga pattern coordinates multi-service operations with compensating transactions.

# Saga: Place Order (Choreography-based)
#
# Happy path:
# 1. Order Service:   CREATE order (status=PENDING)
#    → publishes "OrderCreated"
# 2. Inventory Service: RESERVE items
#    → publishes "InventoryReserved"
# 3. Payment Service:  CHARGE customer
#    → publishes "PaymentProcessed"
# 4. Order Service:   UPDATE order (status=CONFIRMED)
#    → publishes "OrderConfirmed"
# 5. Shipping Service: CREATE shipment
#
# Failure at step 3 (payment fails):
# 3. Payment Service:  CHARGE fails
#    → publishes "PaymentFailed"
# 4. Inventory Service: RELEASE reserved items (compensating transaction)
#    → publishes "InventoryReleased"
# 5. Order Service:   UPDATE order (status=CANCELLED)
#    → publishes "OrderCancelled"
# 6. Notification Service: EMAIL customer "order failed"

# Orchestration-based Saga (Order Saga Orchestrator)
class OrderSagaOrchestrator:
    """Central coordinator that drives the saga steps."""

    async def execute(self, order_request):
        saga_id = generate_id()
        try:
            # Step 1: Create order
            order = await self.order_service.create(order_request)
            self.log_step(saga_id, "ORDER_CREATED", order.id)

            # Step 2: Reserve inventory
            reservation = await self.inventory_service.reserve(
                order.items
            )
            self.log_step(saga_id, "INVENTORY_RESERVED", reservation.id)

            # Step 3: Process payment
            payment = await self.payment_service.charge(
                order.customer_id, order.total
            )
            self.log_step(saga_id, "PAYMENT_PROCESSED", payment.id)

            # Step 4: Confirm order
            await self.order_service.confirm(order.id)
            self.log_step(saga_id, "ORDER_CONFIRMED", order.id)

            return order

        except PaymentFailedError:
            # Compensate: release inventory
            await self.inventory_service.release(reservation.id)
            await self.order_service.cancel(order.id)
            self.log_step(saga_id, "SAGA_COMPENSATED", "payment_failed")
            raise

        except InventoryInsufficientError:
            # Compensate: cancel order
            await self.order_service.cancel(order.id)
            self.log_step(saga_id, "SAGA_COMPENSATED", "no_inventory")
            raise

Data Consistency

Microservices embrace eventual consistency. Data will converge to a consistent state, but there’s a window where different services have different views. Techniques:

Event sourcing: Store events instead of current state. Replay events to rebuild state. Guarantees audit trail and temporal queries.
CQRS: Separate read and write models. Writes go to the source-of-truth service; reads come from materialised views optimised for queries.
Outbox pattern: Write the event to a local “outbox” table in the same DB transaction as the business data. A separate process polls the outbox and publishes to the message broker. Guarantees at-least-once delivery.

# Outbox Pattern — guarantees event delivery
class OrderService:
    def place_order(self, order_request):
        with self.db.transaction():
            # Business write
            order = Order.create(order_request)
            self.db.insert("orders", order)

            # Event write — SAME transaction
            event = {
                "id": uuid4(),
                "aggregate_type": "Order",
                "aggregate_id": order.id,
                "event_type": "OrderPlaced",
                "payload": order.to_dict(),
                "created_at": datetime.utcnow()
            }
            self.db.insert("outbox", event)
            # Both succeed or both fail — no dual-write problem

# Outbox Relay (separate process or Debezium CDC)
class OutboxRelay:
    def poll_and_publish(self):
        events = self.db.query(
            "SELECT * FROM outbox WHERE published = FALSE "
            "ORDER BY created_at LIMIT 100"
        )
        for event in events:
            self.kafka.send(event["event_type"], event["payload"])
            self.db.update("outbox",
                           {"published": True},
                           {"id": event["id"]})

Operational Complexity

Microservices trade code complexity for operational complexity:

Monolith	Microservices
1 repo	50–500 repos (or a monorepo)
1 CI pipeline	50–500 CI pipelines
1 deployment	50–500 deployments per day
1 log file	Centralised logging (ELK/Loki) required
Stack traces	Distributed tracing (Jaeger) required
Local dev: `./run.sh`	Local dev: Docker Compose with 15 services + Kafka + DBs
Hire backend devs	Hire platform/DevOps/SRE team

Network Failures & Resilience

In a monolith, function calls don’t fail due to network issues. In microservices, every call is a network call. The network is unreliable:

Timeouts: Always set timeouts. No timeout = thread/connection leak = cascading failure.
Retries with exponential backoff: Retry transient failures, but with jitter to avoid thundering herd.
Circuit breaker: When a service is consistently failing, stop calling it for a cool-down period. Fail fast instead of waiting for timeouts.
Bulkhead: Isolate thread pools per downstream service. If Payment Service is slow, it shouldn’t consume all threads and starve Order reads.

# Circuit breaker implementation
from enum import Enum
import time

class CircuitState(Enum):
    CLOSED = "closed"     # normal — requests flow through
    OPEN = "open"         # tripped — requests fail immediately
    HALF_OPEN = "half_open"  # testing — allow one request through

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=30):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise CircuitOpenError("Circuit is OPEN — failing fast")

        try:
            result = func(*args, **kwargs)
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            raise

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, reset_timeout=30)
try:
    result = payment_breaker.call(payment_service.charge, amount=99.99)
except CircuitOpenError:
    # Degrade gracefully: queue for later, show "payment pending"
    queue_for_retry(order_id, amount)

When NOT to Use Microservices

Microservices are not a silver bullet. They are a trade-off, and for many teams, the trade-off isn’t worth it.

Don’t use microservices if…	Why	Better alternative
Team < 10 engineers	Operational overhead dwarfs development speed gains	Modular monolith
Domain not well understood	Wrong boundaries are 10× harder to fix in distributed systems	Monolith first, extract later
Startup (0 → 1 phase)	Requirements change weekly; need speed, not architecture	Monolith (or serverless)
Strong data consistency required	Distributed transactions (sagas) are complex and eventually consistent	Monolith with strong ACID
No DevOps/platform team	Who manages Kubernetes, service mesh, CI/CD pipelines, observability?	PaaS (Heroku, Railway) + monolith
Latency-critical (HFT, gaming)	Every network hop adds 0.5–5ms; inter-process calls are nanoseconds	Monolith or in-process modules

Martin Fowler’s advice: “Don’t even consider microservices unless you have a system that’s too complex to manage as a monolith.” The default should be monolith. Microservices are a response to a specific set of scaling and organisational problems.

Conway’s Law & Two-Pizza Teams

Conway’s Law (1967): “Any organisation that designs a system will produce a design whose structure is a copy of the organisation’s communication structure.”

This is not just an observation — it’s a force of nature in software. If you have three frontend teams and one backend team, you’ll build three frontends and one backend. The architecture mirrors the org chart.

# Conway's Law in practice:
#
# ┌──────────────────────────────────────────────────────────┐
# │                    ORGANISATION                          │
# │                                                          │
# │  ┌─── Team A ────┐  ┌─── Team B ────┐  ┌─── Team C ──┐ │
# │  │ Order Domain   │  │ Payment Domain│  │ Search/Recs  │ │
# │  │ 6 engineers    │  │ 5 engineers   │  │ 7 engineers  │ │
# │  │ Backend + DB   │  │ Backend + DB  │  │ ML + Backend │ │
# │  └───────┬────────┘  └───────┬───────┘  └──────┬──────┘ │
# │          │                   │                  │        │
# └──────────┼───────────────────┼──────────────────┼────────┘
#            │                   │                  │
#            ▼                   ▼                  ▼
# ┌──────────────────────────────────────────────────────────┐
# │                    ARCHITECTURE                          │
# │                                                          │
# │  ┌── Order Svc ──┐  ┌── Payment Svc ┐  ┌── Search Svc ┐ │
# │  │ REST API      │  │ REST API      │  │ gRPC API     │ │
# │  │ MySQL         │  │ PostgreSQL    │  │ Elasticsearch│ │
# │  │ Order events  │  │ Payment events│  │ ML pipeline  │ │
# │  └───────────────┘  └───────────────┘  └──────────────┘ │
# └──────────────────────────────────────────────────────────┘
#
# The architecture IS the org chart

Amazon’s Two-Pizza Teams

Jeff Bezos famously mandated that every team should be small enough to be fed by two pizzas (6–10 people). Combined with his 2002 API mandate (“all teams will expose data and functionality through service interfaces”), this created the organisational structure that produced microservices.

Team size: 6–10 people (2 pizzas). Small enough to have minimal communication overhead, large enough to own a service end-to-end.
Full ownership: Each team owns its service: code, deployment, on-call, database, monitoring. “You build it, you run it.”
Service interface mandate: All communication between teams happens via APIs. No shared databases, no back-door file access, no shared-memory models.
Inverse Conway Manoeuvre: Intentionally structure your org to produce the architecture you want. Want microservices? Create small, autonomous, domain-aligned teams first.

Lesson from Amazon: Bezos didn’t mandate a technology architecture — he mandated an organisational architecture. The microservices followed naturally. If you try to adopt microservices without changing your org structure, you’ll fail. Cross-functional, autonomous teams are a prerequisite.

Key Takeaways

Start with a monolith unless you have a proven need for independent deployment and scaling. Microservices are a tool for organisational scaling, not a default architecture.
Bounded contexts from DDD are the right way to draw service boundaries. Get boundaries wrong and you’ll build a distributed monolith worse than what you started with.
Database-per-service is non-negotiable. Shared databases create coupling that defeats the purpose of microservices. Embrace data duplication and eventual consistency.
Prefer asynchronous communication (events, message queues) over synchronous (REST, gRPC) for inter-service commands. Use sync for queries and user-facing responses.
The Strangler Fig pattern is the safest migration path. Extract one service at a time, starting with the highest-value, most loosely-coupled module.
Invest in platform infrastructure: service mesh, distributed tracing, centralised logging, CI/CD per service. Without this, operational complexity will overwhelm you.
Conway’s Law is real. Restructure your organisation into small, autonomous, domain-aligned teams before decomposing the monolith. Technology follows people.
Contract testing (Pact) is the key testing strategy for microservices. Don’t rely on E2E tests to catch inter-service compatibility issues.