Microservices Architecture
The Monolith — Where Every System Begins
A monolith is a single deployable unit where all business logic lives in one codebase, one process, and one database. It’s not inherently bad — every successful company started with one:
| Company | Original Monolith | When They Split | Trigger |
|---|---|---|---|
| Amazon | Perl/C++ single app (Obidos) | ~2002 | Deployment took 8+ hours; teams blocked each other |
| Netflix | Java monolith on Oracle DB | 2008–2012 | 3-day database corruption outage |
| Ruby on Rails “fail whale” | 2010–2013 | Couldn’t handle World Cup traffic spikes | |
| Uber | PHP monolith | 2014–2016 | Dispatch logic coupled to payment, GPS, and notifications |
| Shopify | Rails monolith (still monolith!) | Ongoing modularisation | Chose modular monolith over full microservices |
Monolith Pain Points That Drive Migration
# The classic monolith deployment problem:
# 1. Developer changes 3 lines in the payment module
# 2. Entire application must be rebuilt (45 min)
# 3. Full test suite runs (2+ hours)
# 4. Deployment window: Saturday 2 AM
# 5. QA must re-verify search, inventory, and recommendations
# even though they didn't change
# 6. If payment breaks, rollback rolls back EVERYTHING
# Merge conflicts: 200+ developers, 1 repo, 1 branch strategy
$ git pull origin main
# CONFLICT: 47 files changed by 12 different teams
# "Who changed the User model AGAIN?"
# Scaling nightmare:
# Search needs 16 CPU cores but only 2 GB RAM
# Image processing needs 2 cores but 64 GB RAM
# But they're the same binary — you can't scale them independently
# So you run 20 instances of the ENTIRE app on 64 GB machines
# just because ONE module needs memory
What Are Microservices?
Microservices is an architectural style where a system is composed of small, independently deployable services, each owning a specific business capability, communicating over well-defined APIs.
The key properties are:
- Single responsibility: Each service does one thing well (orders, payments, inventory, search).
- Independent deployment: Change and deploy the payment service without touching search.
- Own data store: Each service owns its database. No shared databases.
- Technology heterogeneity: Search in Elasticsearch, payments in Java, recommendations in Python — each team picks what’s best.
- Decentralised governance: No central architecture committee dictating frameworks.
- Designed for failure: Every network call can fail. Build resilience in from day one.
# Microservices architecture for an e-commerce platform:
#
# ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
# │ API │ │ Web │ │ Mobile │ │ Partner │
# │ Gateway │ │ App │ │ App │ │ API │
# └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
# │ │ │ │
# ─────┴──────────────┴──────────────┴──────────────┴─────
# API Gateway / Load Balancer
# ────────────────────────────────────────────────────────
# │ │ │ │ │
# ┌────▼───┐ ┌───▼────┐ ┌──▼───┐ ┌──▼───┐ ┌───▼────┐
# │ User │ │ Product│ │Order │ │Pay- │ │ Search │
# │Service │ │Service │ │Svc │ │ment │ │Service │
# │(Go) │ │(Java) │ │(Java)│ │(Java)│ │(Python)│
# └───┬────┘ └───┬────┘ └──┬───┘ └──┬───┘ └───┬────┘
# │ │ │ │ │
# ┌───▼───┐ ┌───▼───┐ ┌───▼──┐ ┌───▼──┐ ┌───▼────────┐
# │Postgres│ │Postgres│ │MySQL │ │MySQL │ │Elasticsearch│
# └───────┘ └───────┘ └──────┘ └──────┘ └────────────┘
Service Boundaries & Bounded Contexts
The hardest part of microservices isn’t the technology — it’s drawing the right boundaries. Get them wrong and you’ll build a distributed monolith: all the complexity of distributed systems with none of the benefits.
Domain-Driven Design (DDD) provides the conceptual framework. A bounded context is a boundary within which a particular domain model is defined and applicable.
Example: E-Commerce Bounded Contexts
# The word "Product" means different things in different contexts:
#
# Catalog Context:
# Product = name, description, images, categories, SEO metadata
# Operations: browse, search, filter, compare
#
# Inventory Context:
# Product = SKU, warehouse location, quantity, reorder threshold
# Operations: reserve, restock, count, transfer between warehouses
#
# Pricing Context:
# Product = base price, discounts, tax rules, currency conversions
# Operations: calculate price, apply coupon, dynamic pricing
#
# Shipping Context:
# Product = weight, dimensions, fragility flag, hazmat classification
# Operations: calculate shipping cost, estimate delivery date
# WRONG: One "Product" service that handles everything
# RIGHT: Four services, each with its own "Product" model
# Each context has its OWN representation:
class CatalogProduct:
id: str
name: str
description: str
images: list[str]
categories: list[str]
class InventoryItem:
sku: str
product_id: str # reference to catalog
warehouse_id: str
quantity_available: int
quantity_reserved: int
reorder_point: int
class PricingEntry:
product_id: str # reference to catalog
base_price: Decimal
currency: str
discount_rules: list[DiscountRule]
tax_category: str
class ShippableItem:
product_id: str # reference to catalog
weight_kg: float
dimensions_cm: tuple[float, float, float]
is_fragile: bool
hazmat_class: str | None
How to Identify Boundaries
- Event Storming: Gather domain experts and developers in a room. Write every domain event on sticky notes (“Order Placed”, “Payment Received”, “Item Shipped”). Group related events — each group suggests a bounded context.
- Linguistic boundaries: When the same word means different things to different teams, you’ve found a boundary. “Account” means “user profile” to the identity team and “billing entity” to finance.
- Change cadence: Features that change together should live together. If search changes weekly but shipping rules change quarterly, they’re separate contexts.
- Data ownership: Who is the authoritative source for this data? Whoever writes it, owns the service. Everybody else reads a copy or calls an API.
Migration Visualised: Monolith → Microservices
Step through the progressive extraction of a monolith into independent microservices.
▶ Monolith to Microservices
Communication: Sync vs Async
Services must communicate. The choice between synchronous and asynchronous communication is one of the most consequential architectural decisions.
Synchronous Communication (Request/Response)
Service A sends a request to Service B and waits for the response. The caller is blocked until the response arrives (or a timeout fires).
| Protocol | Format | Use Case | Latency |
|---|---|---|---|
| REST/HTTP | JSON | Public APIs, CRUD, browser-facing | 1–50 ms (intra-DC) |
| gRPC | Protocol Buffers (binary) | Internal service-to-service, streaming | 0.5–10 ms |
| GraphQL | JSON | API gateway aggregation, mobile clients | Variable (depends on resolvers) |
# REST example — Order Service calls Payment Service
import requests
class OrderService:
def place_order(self, order):
# SYNCHRONOUS: we block here waiting for payment
response = requests.post(
"http://payment-service:8080/api/v1/charges",
json={
"order_id": order.id,
"amount": order.total,
"currency": "USD",
"customer_id": order.customer_id
},
timeout=5 # 5 second timeout — CRITICAL
)
if response.status_code == 201:
order.status = "CONFIRMED"
# What if the DB write fails here?
# Payment was charged but order not confirmed!
else:
order.status = "PAYMENT_FAILED"
# gRPC example — much faster, strongly typed
# payment.proto
# syntax = "proto3";
# service PaymentService {
# rpc ChargeCustomer(ChargeRequest) returns (ChargeResponse);
# rpc StreamPayments(PaymentFilter) returns (stream Payment);
# }
# message ChargeRequest {
# string order_id = 1;
# int64 amount_cents = 2;
# string currency = 3;
# string customer_id = 4;
# }
Asynchronous Communication (Event-Driven)
Service A publishes an event/message and continues immediately. Service B processes it later. The producer and consumer are decoupled in time.
# Event-driven: Order Service publishes, Payment Service subscribes
import json
from kafka import KafkaProducer, KafkaConsumer
# --- ORDER SERVICE (Producer) ---
producer = KafkaProducer(
bootstrap_servers='kafka:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
class OrderService:
def place_order(self, order):
order.status = "PENDING_PAYMENT"
self.save(order)
# ASYNCHRONOUS: publish and continue — don't wait
producer.send('order-events', {
"event_type": "OrderPlaced",
"order_id": order.id,
"amount": str(order.total),
"currency": "USD",
"customer_id": order.customer_id,
"timestamp": datetime.utcnow().isoformat()
})
# Order service is FREE to handle the next request immediately
# --- PAYMENT SERVICE (Consumer) ---
consumer = KafkaConsumer(
'order-events',
bootstrap_servers='kafka:9092',
group_id='payment-service',
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
for message in consumer:
event = message.value
if event["event_type"] == "OrderPlaced":
result = charge_customer(event["customer_id"], event["amount"])
# Publish result as another event
producer.send('payment-events', {
"event_type": "PaymentProcessed",
"order_id": event["order_id"],
"status": "SUCCESS" if result.ok else "FAILED",
"timestamp": datetime.utcnow().isoformat()
})
| Aspect | Synchronous | Asynchronous |
|---|---|---|
| Coupling | Temporal & spatial coupling | Decoupled (fire-and-forget) |
| Latency | Caller blocked until response | Caller returns immediately |
| Failure | Cascading failures (A→B→C all down) | Isolated (B can catch up later) |
| Consistency | Easier to achieve strong consistency | Eventual consistency |
| Debugging | Stack traces, straightforward | Distributed tracing required |
| Best for | Queries, reads, user-facing requests | Commands, writes, background processing |
Communication Visualised
Compare how synchronous and asynchronous communication differ in timing, blocking, and throughput.
▶ Sync vs Async Communication
API Composition Pattern
In a monolith, displaying an order detail page is a single SQL JOIN. In microservices, data lives in separate services. The API Composition pattern solves this by having a composer service aggregate data from multiple sources.
# API Gateway / BFF (Backend For Frontend) — composes responses
class OrderDetailComposer:
"""
Aggregates data from 4 services to build the order detail view.
Uses asyncio to call all services concurrently — total latency
is max(individual latencies), not sum.
"""
async def get_order_detail(self, order_id: str) -> dict:
# Fan-out: call all services in parallel
order, customer, products, shipment = await asyncio.gather(
self.order_service.get_order(order_id),
self.customer_service.get_customer(order.customer_id),
self.catalog_service.get_products(order.product_ids),
self.shipping_service.get_shipment(order_id),
)
# Fan-in: compose the response
return {
"order_id": order.id,
"status": order.status,
"placed_at": order.created_at,
"customer": {
"name": customer.name,
"email": customer.email
},
"items": [
{
"product_name": p.name,
"quantity": order.items[p.id].quantity,
"price": p.price
}
for p in products
],
"shipping": {
"carrier": shipment.carrier,
"tracking": shipment.tracking_number,
"estimated_delivery": shipment.eta
},
"total": order.total
}
# Latency comparison:
# Monolith: 1 SQL query = ~5 ms
# Microservices (sequential): 4 API calls × ~10ms = ~40 ms
# Microservices (parallel): max(10, 8, 12, 9) = ~12 ms + overhead
Data Ownership: Database Per Service
The database-per-service pattern is the most critical (and most painful) principle of microservices. Each service owns its data. No other service can access that database directly — only through the owning service’s API.
# ✅ CORRECT: Each service owns its database
#
# Order Service → orders_db (MySQL)
# Tables: orders, order_items, order_status_history
# Only Order Service reads/writes these tables
#
# Payment Service → payments_db (PostgreSQL)
# Tables: payments, refunds, payment_methods
# Only Payment Service reads/writes these tables
#
# Inventory Service → inventory_db (PostgreSQL)
# Tables: inventory, reservations, warehouses
# Only Inventory Service reads/writes these tables
# ❌ ANTI-PATTERN: Shared database
#
# Order Service ─┐
# Payment Service ─┼──→ shared_db (one big database)
# Inventory Service┘
# All services read/write the same tables
# Schema changes require coordinating ALL teams
# No independent deployment — you have a distributed monolith
Why Shared Databases Are Toxic
- Schema coupling: Changing a column in the
userstable requires coordinating every service that reads it. One team’s migration breaks another team’s queries. - Performance coupling: The analytics service runs a full table scan, locking rows that the order service needs. One slow query affects everyone.
- Technology lock-in: Stuck on one database vendor because all services depend on PostgreSQL-specific features.
- Deployment coupling: Can’t deploy independently if you need coordinated schema migrations.
Data Duplication Is Okay
In microservices, some data duplication is expected and healthy. The Order Service stores a snapshot of the product name and price at the time of purchase. If the Catalog Service later changes the product name, old orders still show the original name.
# Order Service stores a SNAPSHOT, not a reference
order_item = {
"product_id": "prod-123", # reference for linking
"product_name": "Wireless Mouse", # snapshot — won't change
"price_at_purchase": 29.99, # snapshot — frozen in time
"quantity": 2
}
# Even if the product is renamed to "Ergonomic Wireless Mouse"
# in the catalog, this order still shows "Wireless Mouse"
The Strangler Fig Pattern
Named after the strangler fig tree that grows around a host tree and eventually replaces it. This is the safest way to migrate from a monolith — incrementally, without a risky “big bang” rewrite.
# Phase 1: Intercept — Route traffic through a proxy
# All traffic goes through the proxy. Initially, 100% goes to monolith.
#
# Client → [Proxy/API Gateway] → Monolith
# (handles everything)
# Phase 2: Extract — Build new service for ONE feature
# Search is extracted first (high-value, loosely coupled)
#
# Client → [Proxy/API Gateway] ─── /api/search → Search Service (new)
# └── /api/* → Monolith (everything else)
# Phase 3: Migrate data — Dual-write or CDC
# Old search code in monolith still exists but receives no traffic.
# Use Change Data Capture (CDC) to sync data from monolith DB
# to the new Search service's Elasticsearch index.
#
# Monolith DB ──CDC──→ Search Service (Elasticsearch)
# Phase 4: Repeat — Extract next service
# Client → [Proxy] ─── /api/search → Search Service
# ├── /api/payments → Payment Service (new)
# └── /api/* → Monolith (shrinking)
# Phase 5: Decommission — Remove dead code from monolith
# After 6 months with zero traffic to monolith's search module:
# 1. Remove search code from monolith
# 2. Remove search tables from monolith DB
# 3. Shrink monolith's deployment resources
# The monolith gradually "dies" — strangled by the new services
Service Mesh
When you have 50+ microservices, managing networking concerns (retries, timeouts, circuit breakers, mTLS, observability) in application code becomes unsustainable. A service mesh extracts this into infrastructure.
# Without service mesh — every service implements its own:
class PaymentClient:
def charge(self, amount):
retries = 3
for attempt in range(retries):
try:
response = requests.post(
"http://payment-service:8080/charge",
json={"amount": amount},
timeout=5
)
if response.status_code == 503:
time.sleep(2 ** attempt) # exponential backoff
continue
return response.json()
except requests.Timeout:
if attempt == retries - 1:
raise
time.sleep(2 ** attempt)
# Every client, in every service, implements this same pattern
# 200 services × 10 clients each = 2000 retry implementations
# each slightly different, each with its own bugs
# With service mesh (Istio/Linkerd) — networking is in the sidecar:
#
# ┌─────────────────────────────────┐
# │ Pod │
# │ ┌──────────┐ ┌──────────────┐ │
# │ │ App │ │ Envoy Proxy │ │
# │ │Container │──│ (Sidecar) │ │
# │ │ │ │ - mTLS │ │
# │ │ Simple │ │ - Retries │ │
# │ │ HTTP │ │ - Timeouts │ │
# │ │ calls │ │ - Circuit │ │
# │ │ │ │ breaker │ │
# │ │ │ │ - Tracing │ │
# │ │ │ │ - Metrics │ │
# │ └──────────┘ └──────────────┘ │
# └─────────────────────────────────┘
#
# App code becomes:
class PaymentClient:
def charge(self, amount):
response = requests.post(
"http://payment-service:8080/charge",
json={"amount": amount}
)
return response.json()
# That's it. Envoy handles retries, timeouts, mTLS, etc.
# Istio VirtualService — configure retries in YAML, not code
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payment-service
spec:
hosts:
- payment-service
http:
- route:
- destination:
host: payment-service
port:
number: 8080
retries:
attempts: 3
perTryTimeout: 5s
retryOn: 5xx,reset,connect-failure
timeout: 15s
---
# Istio DestinationRule — circuit breaker configuration
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: DEFAULT
http1MaxPendingRequests: 50
http2MaxRequests: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
Distributed Tracing
In a monolith, a single stack trace shows the full request flow. In microservices, a single user request might touch 8 services. Distributed tracing (Jaeger, Zipkin, OpenTelemetry) stitches traces together across service boundaries.
# OpenTelemetry — instrument your services
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setup (once at startup)
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(agent_host_name="jaeger", agent_port=6831)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer("order-service")
# Instrument your code
class OrderService:
def place_order(self, request):
with tracer.start_as_current_span("place_order") as span:
span.set_attribute("order.customer_id", request.customer_id)
span.set_attribute("order.item_count", len(request.items))
# Each downstream call inherits the trace context
with tracer.start_as_current_span("validate_inventory"):
inventory = self.inventory_client.check(request.items)
# Inventory service creates child spans automatically
# via context propagation headers (traceparent)
with tracer.start_as_current_span("process_payment"):
payment = self.payment_client.charge(request.total)
with tracer.start_as_current_span("create_order"):
order = self.repository.save(request)
return order
# What you see in Jaeger UI:
# ─── place_order (Order Service) ──────────── 145ms ────
# ├── validate_inventory (Order Svc) ─ 12ms ─┐
# │ └── check_stock (Inventory Svc) 10ms │
# ├── process_payment (Order Svc) ──── 98ms ─┐
# │ ├── fraud_check (Payment Svc) -- 45ms │
# │ └── charge_card (Payment Svc) -- 50ms │
# └── create_order (Order Svc) ─────── 8ms │
# └── db_insert (Order Svc) ──── 5ms │
Testing Microservices
Testing distributed systems is fundamentally harder than testing a monolith. The Testing Pyramid still applies, but new layers appear.
Contract Testing (Pact)
Consumer-driven contracts verify that a service provider meets the expectations of its consumers, without deploying both services together. The consumer writes a contract describing what it expects; the provider verifies it can fulfil that contract.
# Consumer side (Order Service expects this from Payment Service)
# order_service/tests/test_payment_contract.py
from pact import Consumer, Provider
pact = Consumer('OrderService').has_pact_with(Provider('PaymentService'))
def test_charge_customer():
expected_body = {
"payment_id": "pay-abc123",
"status": "SUCCESS",
"amount": 99.99
}
(pact
.given("a valid customer with payment method")
.upon_receiving("a charge request")
.with_request("POST", "/api/v1/charges",
body={
"order_id": "ord-789",
"amount": 99.99,
"currency": "USD",
"customer_id": "cust-456"
})
.will_respond_with(201, body=expected_body))
with pact:
# This test runs against a mock server
result = PaymentClient("http://localhost:1234").charge(
order_id="ord-789",
amount=99.99,
currency="USD",
customer_id="cust-456"
)
assert result["status"] == "SUCCESS"
# The pact file is published to a Pact Broker
# Payment Service's CI pulls it and verifies:
# "Can I actually return what OrderService expects?"
# Provider side (Payment Service verifies the contract)
# payment_service/tests/test_verify_contracts.py
from pact import Verifier
def test_verify_order_service_contract():
verifier = Verifier(
provider='PaymentService',
provider_base_url='http://localhost:8080'
)
output, _ = verifier.verify_pacts(
'./pacts/OrderService-PaymentService.json',
provider_states_setup_url='http://localhost:8080/_pact/setup'
)
The Testing Honeycomb
# Microservices testing layers (most to least):
#
# ┌─────────────────┐
# │ E2E Tests │ Few — slow, flaky, expensive
# │ (Cypress/ │ Test critical user journeys only
# │ Playwright) │
# ├─────────────────┤
# │ Integration │ Test real DB, real message broker
# │ Tests │ Use Testcontainers for isolation
# ├─────────────────┤
# │ Contract Tests │ ← THE KEY LAYER for microservices
# │ (Pact) │ Verify inter-service compatibility
# ├─────────────────┤
# │ Component │ Test one service in isolation
# │ Tests │ Mock external dependencies
# ├─────────────────┤
# │ Unit Tests │ Fast, isolated, lots of them
# │ │ Domain logic, pure functions
# └─────────────────┘
#
# Anti-pattern: relying on E2E tests to catch integration issues.
# A 200-service E2E test suite takes hours and breaks constantly.
Deployment: Containers & Kubernetes
Microservices and containers are natural partners. Each service is packaged as a Docker image and orchestrated by Kubernetes.
Dockerising a Service
# Dockerfile — multi-stage build for a Go service
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /order-service ./cmd/server
FROM alpine:3.19
RUN apk --no-cache add ca-certificates
COPY --from=builder /order-service /order-service
EXPOSE 8080
USER nobody
ENTRYPOINT ["/order-service"]
# Image size: ~15 MB (vs 1.2 GB for a monolith with all dependencies)
# Build time: ~30 seconds (vs 45 minutes for the full monolith)
# Startup time: ~2 seconds (vs 5 minutes for Spring Boot monolith)
Kubernetes Deployment
# k8s/order-service/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: ecommerce
labels:
app: order-service
version: v2.3.1
spec:
replicas: 3
selector:
matchLabels:
app: order-service
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
metadata:
labels:
app: order-service
version: v2.3.1
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: order-service
containers:
- name: order-service
image: registry.internal/ecommerce/order-service:v2.3.1
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: order-db-credentials
key: host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: order-db-credentials
key: password
- name: KAFKA_BROKERS
value: "kafka-0.kafka:9092,kafka-1.kafka:9092"
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
name: order-service
namespace: ecommerce
spec:
selector:
app: order-service
ports:
- port: 8080
targetPort: 8080
name: http
- port: 9090
targetPort: 9090
name: metrics
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service
namespace: ecommerce
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
CI/CD Per Service
# .github/workflows/order-service.yml
name: Order Service CI/CD
on:
push:
paths:
- 'services/order-service/**' # Only triggers on changes to this service
- '.github/workflows/order-service.yml'
jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: orders_test
POSTGRES_PASSWORD: test
ports: ['5432:5432']
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with: { go-version: '1.22' }
- run: cd services/order-service && go test ./... -race -cover
- name: Contract Tests
run: |
cd services/order-service
go test ./contracts/... -tags=contract
build-and-push:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/build-push-action@v5
with:
context: services/order-service
push: true
tags: |
registry.internal/ecommerce/order-service:${{ github.sha }}
registry.internal/ecommerce/order-service:latest
deploy:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
kubectl set image deployment/order-service \
order-service=registry.internal/ecommerce/order-service:${{ github.sha }} \
-n ecommerce
kubectl rollout status deployment/order-service -n ecommerce --timeout=300s
The Hard Parts: Challenges & Trade-offs
Distributed Transactions
In a monolith, a single database transaction guarantees ACID. With database-per-service, there’s no global transaction. The Saga pattern coordinates multi-service operations with compensating transactions.
# Saga: Place Order (Choreography-based)
#
# Happy path:
# 1. Order Service: CREATE order (status=PENDING)
# → publishes "OrderCreated"
# 2. Inventory Service: RESERVE items
# → publishes "InventoryReserved"
# 3. Payment Service: CHARGE customer
# → publishes "PaymentProcessed"
# 4. Order Service: UPDATE order (status=CONFIRMED)
# → publishes "OrderConfirmed"
# 5. Shipping Service: CREATE shipment
#
# Failure at step 3 (payment fails):
# 3. Payment Service: CHARGE fails
# → publishes "PaymentFailed"
# 4. Inventory Service: RELEASE reserved items (compensating transaction)
# → publishes "InventoryReleased"
# 5. Order Service: UPDATE order (status=CANCELLED)
# → publishes "OrderCancelled"
# 6. Notification Service: EMAIL customer "order failed"
# Orchestration-based Saga (Order Saga Orchestrator)
class OrderSagaOrchestrator:
"""Central coordinator that drives the saga steps."""
async def execute(self, order_request):
saga_id = generate_id()
try:
# Step 1: Create order
order = await self.order_service.create(order_request)
self.log_step(saga_id, "ORDER_CREATED", order.id)
# Step 2: Reserve inventory
reservation = await self.inventory_service.reserve(
order.items
)
self.log_step(saga_id, "INVENTORY_RESERVED", reservation.id)
# Step 3: Process payment
payment = await self.payment_service.charge(
order.customer_id, order.total
)
self.log_step(saga_id, "PAYMENT_PROCESSED", payment.id)
# Step 4: Confirm order
await self.order_service.confirm(order.id)
self.log_step(saga_id, "ORDER_CONFIRMED", order.id)
return order
except PaymentFailedError:
# Compensate: release inventory
await self.inventory_service.release(reservation.id)
await self.order_service.cancel(order.id)
self.log_step(saga_id, "SAGA_COMPENSATED", "payment_failed")
raise
except InventoryInsufficientError:
# Compensate: cancel order
await self.order_service.cancel(order.id)
self.log_step(saga_id, "SAGA_COMPENSATED", "no_inventory")
raise
Data Consistency
Microservices embrace eventual consistency. Data will converge to a consistent state, but there’s a window where different services have different views. Techniques:
- Event sourcing: Store events instead of current state. Replay events to rebuild state. Guarantees audit trail and temporal queries.
- CQRS: Separate read and write models. Writes go to the source-of-truth service; reads come from materialised views optimised for queries.
- Outbox pattern: Write the event to a local “outbox” table in the same DB transaction as the business data. A separate process polls the outbox and publishes to the message broker. Guarantees at-least-once delivery.
# Outbox Pattern — guarantees event delivery
class OrderService:
def place_order(self, order_request):
with self.db.transaction():
# Business write
order = Order.create(order_request)
self.db.insert("orders", order)
# Event write — SAME transaction
event = {
"id": uuid4(),
"aggregate_type": "Order",
"aggregate_id": order.id,
"event_type": "OrderPlaced",
"payload": order.to_dict(),
"created_at": datetime.utcnow()
}
self.db.insert("outbox", event)
# Both succeed or both fail — no dual-write problem
# Outbox Relay (separate process or Debezium CDC)
class OutboxRelay:
def poll_and_publish(self):
events = self.db.query(
"SELECT * FROM outbox WHERE published = FALSE "
"ORDER BY created_at LIMIT 100"
)
for event in events:
self.kafka.send(event["event_type"], event["payload"])
self.db.update("outbox",
{"published": True},
{"id": event["id"]})
Operational Complexity
Microservices trade code complexity for operational complexity:
| Monolith | Microservices |
|---|---|
| 1 repo | 50–500 repos (or a monorepo) |
| 1 CI pipeline | 50–500 CI pipelines |
| 1 deployment | 50–500 deployments per day |
| 1 log file | Centralised logging (ELK/Loki) required |
| Stack traces | Distributed tracing (Jaeger) required |
Local dev: ./run.sh | Local dev: Docker Compose with 15 services + Kafka + DBs |
| Hire backend devs | Hire platform/DevOps/SRE team |
Network Failures & Resilience
In a monolith, function calls don’t fail due to network issues. In microservices, every call is a network call. The network is unreliable:
- Timeouts: Always set timeouts. No timeout = thread/connection leak = cascading failure.
- Retries with exponential backoff: Retry transient failures, but with jitter to avoid thundering herd.
- Circuit breaker: When a service is consistently failing, stop calling it for a cool-down period. Fail fast instead of waiting for timeouts.
- Bulkhead: Isolate thread pools per downstream service. If Payment Service is slow, it shouldn’t consume all threads and starve Order reads.
# Circuit breaker implementation
from enum import Enum
import time
class CircuitState(Enum):
CLOSED = "closed" # normal — requests flow through
OPEN = "open" # tripped — requests fail immediately
HALF_OPEN = "half_open" # testing — allow one request through
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=30):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = CircuitState.HALF_OPEN
else:
raise CircuitOpenError("Circuit is OPEN — failing fast")
try:
result = func(*args, **kwargs)
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
raise
# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, reset_timeout=30)
try:
result = payment_breaker.call(payment_service.charge, amount=99.99)
except CircuitOpenError:
# Degrade gracefully: queue for later, show "payment pending"
queue_for_retry(order_id, amount)
When NOT to Use Microservices
Microservices are not a silver bullet. They are a trade-off, and for many teams, the trade-off isn’t worth it.
| Don’t use microservices if… | Why | Better alternative |
|---|---|---|
| Team < 10 engineers | Operational overhead dwarfs development speed gains | Modular monolith |
| Domain not well understood | Wrong boundaries are 10× harder to fix in distributed systems | Monolith first, extract later |
| Startup (0 → 1 phase) | Requirements change weekly; need speed, not architecture | Monolith (or serverless) |
| Strong data consistency required | Distributed transactions (sagas) are complex and eventually consistent | Monolith with strong ACID |
| No DevOps/platform team | Who manages Kubernetes, service mesh, CI/CD pipelines, observability? | PaaS (Heroku, Railway) + monolith |
| Latency-critical (HFT, gaming) | Every network hop adds 0.5–5ms; inter-process calls are nanoseconds | Monolith or in-process modules |
Conway’s Law & Two-Pizza Teams
Conway’s Law (1967): “Any organisation that designs a system will produce a design whose structure is a copy of the organisation’s communication structure.”
This is not just an observation — it’s a force of nature in software. If you have three frontend teams and one backend team, you’ll build three frontends and one backend. The architecture mirrors the org chart.
# Conway's Law in practice:
#
# ┌──────────────────────────────────────────────────────────┐
# │ ORGANISATION │
# │ │
# │ ┌─── Team A ────┐ ┌─── Team B ────┐ ┌─── Team C ──┐ │
# │ │ Order Domain │ │ Payment Domain│ │ Search/Recs │ │
# │ │ 6 engineers │ │ 5 engineers │ │ 7 engineers │ │
# │ │ Backend + DB │ │ Backend + DB │ │ ML + Backend │ │
# │ └───────┬────────┘ └───────┬───────┘ └──────┬──────┘ │
# │ │ │ │ │
# └──────────┼───────────────────┼──────────────────┼────────┘
# │ │ │
# ▼ ▼ ▼
# ┌──────────────────────────────────────────────────────────┐
# │ ARCHITECTURE │
# │ │
# │ ┌── Order Svc ──┐ ┌── Payment Svc ┐ ┌── Search Svc ┐ │
# │ │ REST API │ │ REST API │ │ gRPC API │ │
# │ │ MySQL │ │ PostgreSQL │ │ Elasticsearch│ │
# │ │ Order events │ │ Payment events│ │ ML pipeline │ │
# │ └───────────────┘ └───────────────┘ └──────────────┘ │
# └──────────────────────────────────────────────────────────┘
#
# The architecture IS the org chart
Amazon’s Two-Pizza Teams
Jeff Bezos famously mandated that every team should be small enough to be fed by two pizzas (6–10 people). Combined with his 2002 API mandate (“all teams will expose data and functionality through service interfaces”), this created the organisational structure that produced microservices.
- Team size: 6–10 people (2 pizzas). Small enough to have minimal communication overhead, large enough to own a service end-to-end.
- Full ownership: Each team owns its service: code, deployment, on-call, database, monitoring. “You build it, you run it.”
- Service interface mandate: All communication between teams happens via APIs. No shared databases, no back-door file access, no shared-memory models.
- Inverse Conway Manoeuvre: Intentionally structure your org to produce the architecture you want. Want microservices? Create small, autonomous, domain-aligned teams first.
Key Takeaways
- Start with a monolith unless you have a proven need for independent deployment and scaling. Microservices are a tool for organisational scaling, not a default architecture.
- Bounded contexts from DDD are the right way to draw service boundaries. Get boundaries wrong and you’ll build a distributed monolith worse than what you started with.
- Database-per-service is non-negotiable. Shared databases create coupling that defeats the purpose of microservices. Embrace data duplication and eventual consistency.
- Prefer asynchronous communication (events, message queues) over synchronous (REST, gRPC) for inter-service commands. Use sync for queries and user-facing responses.
- The Strangler Fig pattern is the safest migration path. Extract one service at a time, starting with the highest-value, most loosely-coupled module.
- Invest in platform infrastructure: service mesh, distributed tracing, centralised logging, CI/CD per service. Without this, operational complexity will overwhelm you.
- Conway’s Law is real. Restructure your organisation into small, autonomous, domain-aligned teams before decomposing the monolith. Technology follows people.
- Contract testing (Pact) is the key testing strategy for microservices. Don’t rely on E2E tests to catch inter-service compatibility issues.