← All Posts
High Level Design Series · Building Blocks · Part 5

Service Discovery & Registration

The Problem

In a monolithic application, one service calls another through a simple in-process function call or a well-known, hard-coded address. But in a microservices architecture, everything is dynamic. Containers spin up and down, auto-scaling groups resize, blue-green deployments swap addresses, and Kubernetes pods get new IPs every time they restart. Hard-coding service addresses simply doesn’t work.

Consider a shopping application with these microservices:

# Static configuration — breaks constantly
ORDER_SERVICE=http://10.0.2.15:8080
PAYMENT_SERVICE=http://10.0.2.16:8081
INVENTORY_SERVICE=http://10.0.2.17:8082
NOTIFICATION_SERVICE=http://10.0.2.18:8083

Any time a service crashes and restarts on a different node, any time auto-scaling adds a new instance, or any time a deployment replaces pods — those addresses are stale. Requests fail. Revenue is lost. On-call engineers get paged at 3 AM.

Service discovery solves this by providing a dynamic, real-time mechanism for services to find and communicate with each other. It answers the fundamental question: “What are the current network locations of healthy instances of Service X?”

A service discovery system has three core responsibilities:

  1. Registration: Services announce their presence (address, port, metadata) when they start.
  2. Discovery: Clients query for available instances of a target service.
  3. Health monitoring: Unhealthy instances are detected and removed from the registry.

There are two fundamental patterns for how discovery happens: client-side and server-side.

Client-Side Discovery

In the client-side discovery pattern, the client is responsible for querying the service registry and selecting an appropriate instance. The client has built-in load-balancing logic and connects directly to the chosen instance.

This is the pattern made famous by Netflix OSS: Eureka (service registry) + Ribbon (client-side load balancer).

How It Works

  1. When Service B starts, it registers itself with the registry (IP, port, health URL).
  2. Service B sends periodic heartbeats to the registry (every 30 seconds by default in Eureka).
  3. When Service A needs to call Service B, it queries the registry for all healthy instances of Service B.
  4. Service A’s built-in load balancer (Ribbon) picks one instance using round-robin, weighted, or random strategy.
  5. Service A calls Service B directly — no intermediary.
// Spring Cloud Netflix — Client-Side Discovery
@SpringBootApplication
@EnableDiscoveryClient
public class OrderServiceApplication {
    public static void main(String[] args) {
        SpringApplication.run(OrderServiceApplication.class, args);
    }
}

// application.yml — registration config
eureka:
  client:
    serviceUrl:
      defaultZone: http://eureka-server:8761/eureka/
    registryFetchIntervalSeconds: 5
  instance:
    leaseRenewalIntervalInSeconds: 10
    leaseExpirationDurationInSeconds: 30
    instanceId: ${spring.application.name}:${random.value}
    metadataMap:
      version: v2.1.0
      region: us-east-1

// Using Ribbon for client-side load balancing
@Bean
@LoadBalanced
public RestTemplate restTemplate() {
    return new RestTemplate();
}

// Call by service name — Ribbon resolves to actual IP:port
String result = restTemplate
    .getForObject("http://payment-service/api/pay", String.class);

Pros and Cons

ProsCons
No proxy hop — lower latencyEvery client needs discovery library
Client can make smart routing decisions (zone-aware, version-aware)Couples clients to the registry technology
No single point of failure from a load balancerDiscovery logic duplicated across every language/framework
Client can cache registry data, surviving brief registry outagesHarder to manage when you have polyglot services

Server-Side Discovery

In server-side discovery, the client sends its request to a load balancer or router. The router is responsible for querying the service registry and forwarding the request to an appropriate instance. The client has no knowledge of the registry.

This is the pattern used by AWS Elastic Load Balancer (ELB), Kubernetes Services, and NGINX with service discovery plugins.

How It Works

  1. Service B instances register with the registry at startup.
  2. Service A sends a request to the load balancer (a well-known, stable address).
  3. The load balancer queries the registry for healthy Service B instances.
  4. The load balancer forwards the request to one of the instances.
  5. The response flows back through the load balancer to Service A.
# AWS Application Load Balancer — target group with auto-discovery
resource "aws_lb_target_group" "payment" {
  name        = "payment-service-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = aws_vpc.main.id
  target_type = "ip"

  health_check {
    path                = "/actuator/health"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

# ECS service auto-registers tasks with the target group
resource "aws_ecs_service" "payment" {
  name            = "payment-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.payment.arn
  desired_count   = 3

  load_balancer {
    target_group_arn = aws_lb_target_group.payment.arn
    container_name   = "payment"
    container_port   = 8080
  }
}

Pros and Cons

ProsCons
Clients are simple — no discovery logic neededExtra network hop through load balancer (adds latency)
Language-agnostic — any HTTP client worksLoad balancer can become a bottleneck or SPOF
Centralized traffic management, TLS terminationMore infrastructure to manage and scale
Easy to add cross-cutting concerns (rate limiting, auth)Load balancer must be highly available itself
Which pattern should you choose? Server-side discovery (e.g., Kubernetes Services) is the default for most teams because it keeps clients simple. Client-side discovery is preferred when you need fine-grained routing control, ultra-low latency, or when running on the Netflix OSS / Spring Cloud stack.

▶ Service Discovery Flow

Watch services register, send heartbeats, discover each other, and handle failure.

▶ Client-Side vs Server-Side Discovery

Side-by-side comparison of both patterns.

Service Registry

The service registry is the central database of service instance locations. It is the heart of any discovery system. Every instance that starts must register, and every client must query it. The registry must be highly available, consistent (or at least eventually consistent), and fast.

Requirements

etcd

etcd is a distributed key-value store that uses the Raft consensus algorithm for strong consistency. It is the backbone of Kubernetes — every cluster state (including service endpoints) is stored in etcd.

# Register a service instance in etcd
etcdctl put /services/payment-service/instances/i-001 \
  '{"host":"10.0.1.15","port":8080,"health":"/health","version":"v2.1"}'

# Set a TTL (lease) — instance must renew or be evicted
etcdctl lease grant 30    # 30-second lease
# lease 694d7f0d43c3a01e granted with TTL(30s)

etcdctl put --lease=694d7f0d43c3a01e \
  /services/payment-service/instances/i-001 \
  '{"host":"10.0.1.15","port":8080}'

# Keep alive — service sends this periodically
etcdctl lease keep-alive 694d7f0d43c3a01e

# Discover all instances of payment-service
etcdctl get /services/payment-service/instances/ --prefix
# /services/payment-service/instances/i-001
# {"host":"10.0.1.15","port":8080}
# /services/payment-service/instances/i-002
# {"host":"10.0.1.16","port":8080}

# Watch for changes (real-time push)
etcdctl watch /services/payment-service/instances/ --prefix

Consul

HashiCorp Consul combines a service registry, health checking, and KV store. It uses gossip protocol (Serf) for membership and Raft for leader election and state replication. Consul is designed specifically for service discovery, unlike etcd which is a general-purpose KV store.

# consul-service.json — service registration config
{
  "service": {
    "name": "payment-service",
    "id": "payment-001",
    "port": 8080,
    "tags": ["v2.1", "production", "us-east-1"],
    "meta": {
      "version": "2.1.0",
      "protocol": "grpc"
    },
    "check": {
      "http": "http://localhost:8080/health",
      "interval": "10s",
      "timeout": "3s",
      "deregister_critical_service_after": "90s"
    }
  }
}

# Register via API
curl -X PUT http://consul-server:8500/v1/agent/service/register \
  -d @consul-service.json

# Discover healthy instances via DNS
dig @consul-server -p 8600 payment-service.service.consul SRV
# ;; ANSWER SECTION:
# payment-service.service.consul. 0 IN SRV 1 1 8080 i-001.node.dc1.consul.
# payment-service.service.consul. 0 IN SRV 1 1 8080 i-002.node.dc1.consul.

# Discover via HTTP API
curl http://consul-server:8500/v1/health/service/payment-service?passing=true
# Returns JSON with all healthy instances, ports, metadata

ZooKeeper

Apache ZooKeeper uses the ZAB (ZooKeeper Atomic Broadcast) protocol for consensus. It was one of the earliest coordination services (originally built for Hadoop). Services register as ephemeral znodes that automatically disappear when the session ends (heartbeats stop).

# ZooKeeper service registration using ephemeral nodes
from kazoo.client import KazooClient

zk = KazooClient(hosts='zk1:2181,zk2:2181,zk3:2181')
zk.start()

# Create ephemeral sequential node — auto-deleted on disconnect
zk.create(
    "/services/payment-service/instance-",
    b'{"host":"10.0.1.15","port":8080,"version":"v2.1"}',
    ephemeral=True,     # disappears when session dies
    sequence=True       # appends unique suffix: instance-0000000001
)

# Discover all instances
instances = zk.get_children("/services/payment-service")
for inst in instances:
    data, stat = zk.get(f"/services/payment-service/{inst}")
    print(f"{inst}: {data.decode()}")

# Watch for changes (real-time notification)
@zk.ChildrenWatch("/services/payment-service")
def watch_instances(children):
    print(f"Current instances: {children}")
    # Re-fetch data for each instance, update local cache

Netflix Eureka

Eureka is an AP system (in CAP terms) — it favors availability and partition tolerance over consistency. If a Eureka server loses connectivity, it enters self-preservation mode and stops evicting instances, preferring stale data to no data. This makes Eureka extremely resilient but means clients may occasionally get stale endpoints.

# Eureka server — application.yml
server:
  port: 8761

eureka:
  instance:
    hostname: eureka-server
  client:
    registerWithEureka: false
    fetchRegistry: false
  server:
    enableSelfPreservation: true
    renewalPercentThreshold: 0.85
    evictionIntervalTimerInMs: 60000

# Eureka REST API — query instances
# GET /eureka/apps/PAYMENT-SERVICE
# Returns XML/JSON with all registered instances:
# {
#   "application": {
#     "name": "PAYMENT-SERVICE",
#     "instance": [
#       {
#         "hostName": "10.0.1.15",
#         "port": {"$": 8080, "@enabled": "true"},
#         "status": "UP",
#         "metadata": {"version": "v2.1.0"}
#       }
#     ]
#   }
# }

Comparison

Feature etcd Consul ZooKeeper Eureka
ConsensusRaftRaft + GossipZABPeer replication (AP)
CAPCPCP (default)CPAP
Data modelFlat KVKV + Service catalogHierarchical znodesService-instance hierarchy
Health checkingTTL leasesHTTP, TCP, gRPC, scriptEphemeral nodesHeartbeat (renew lease)
DNS supportNo (use CoreDNS plugin)Built-in DNS interfaceNoNo
LanguageGoGoJavaJava
Primary useKubernetes, general KVService mesh, discoveryHadoop, Kafka coordinationSpring Cloud microservices
Watch supportYes (streaming)Yes (blocking queries)Yes (watchers)Yes (polling)

Health Checks

A registry full of dead instances is worse than no registry at all. Health checks are the mechanism that keeps the registry accurate by continuously verifying that registered instances are actually alive and capable of serving traffic.

Health Check Patterns

1. Heartbeat (Push-Based)

The service instance periodically sends a heartbeat (lease renewal) to the registry. If the registry misses N consecutive heartbeats, it deregisters the instance. Used by Eureka and etcd (TTL leases).

# Eureka heartbeat: instance sends PUT every 30 seconds
PUT /eureka/apps/PAYMENT-SERVICE/i-001
# 200 OK — lease renewed
# If missed for 90 seconds (3 intervals), instance is evicted

2. HTTP Health Endpoint (Pull-Based)

The registry (or a health checker) periodically calls an HTTP endpoint on the service. This is more reliable than heartbeats because it verifies the service can actually handle requests, not just that the process is running.

# Spring Boot Actuator health endpoint
# GET http://payment-service:8080/actuator/health
{
  "status": "UP",
  "components": {
    "db": {
      "status": "UP",
      "details": {
        "database": "PostgreSQL",
        "validationQuery": "isValid()"
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 107374182400,
        "free": 85899345920,
        "threshold": 10485760
      }
    },
    "redis": {
      "status": "UP"
    }
  }
}

3. TCP Check

Simply attempts a TCP connection to the service port. If the connection succeeds, the service is considered healthy. Less thorough than HTTP checks but works for non-HTTP services (databases, message brokers).

4. gRPC Health Check

// gRPC Health Checking Protocol (standard)
syntax = "proto3";
package grpc.health.v1;

service Health {
  rpc Check(HealthCheckRequest) returns (HealthCheckResponse);
  rpc Watch(HealthCheckRequest) returns (stream HealthCheckResponse);
}

message HealthCheckRequest {
  string service = 1;  // empty string = overall health
}

message HealthCheckResponse {
  enum ServingStatus {
    UNKNOWN = 0;
    SERVING = 1;
    NOT_SERVING = 2;
    SERVICE_UNKNOWN = 3;
  }
  ServingStatus status = 1;
}

Consul Health Checks in Detail

# consul-checks.json — multiple health checks per service
{
  "service": {
    "name": "payment-service",
    "port": 8080,
    "checks": [
      {
        "name": "HTTP API Health",
        "http": "http://localhost:8080/health",
        "method": "GET",
        "interval": "10s",
        "timeout": "3s"
      },
      {
        "name": "Database Connectivity",
        "args": ["/usr/local/bin/check-db.sh"],
        "interval": "30s",
        "timeout": "10s"
      },
      {
        "name": "Memory Usage",
        "args": ["/usr/local/bin/check-memory.sh", "80"],
        "interval": "15s"
      }
    ]
  }
}

# Consul check statuses:
# "passing"  — healthy, included in DNS/API results
# "warning"  — degraded, still included by default
# "critical" — unhealthy, excluded from results
# After deregister_critical_service_after, fully removed

Kubernetes Probes

Kubernetes has three types of probes, each serving a distinct purpose:

apiVersion: v1
kind: Pod
metadata:
  name: payment-service
spec:
  containers:
  - name: payment
    image: payment-service:v2.1
    ports:
    - containerPort: 8080

    # LIVENESS PROBE — "Is the process alive?"
    # Failure: kubelet RESTARTS the container
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 15    # wait for app startup
      periodSeconds: 10          # check every 10s
      timeoutSeconds: 3
      failureThreshold: 3        # 3 consecutive failures = restart

    # READINESS PROBE — "Can it handle traffic?"
    # Failure: pod is removed from Service endpoints (no traffic)
    # Pod is NOT restarted — just stops receiving requests
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 3

    # STARTUP PROBE (K8s 1.16+) — "Has it finished starting?"
    # Disables liveness/readiness until startup succeeds
    # Prevents slow-starting apps from being killed prematurely
    startupProbe:
      httpGet:
        path: /health/started
        port: 8080
      failureThreshold: 30       # 30 * 10s = 5 minutes to start
      periodSeconds: 10
Liveness vs. Readiness: A service that is live but not ready (e.g., warming caches, loading ML models) should pass the liveness probe but fail readiness. It stays running but receives no traffic until it’s ready. A service that fails liveness is assumed to be deadlocked or crashed and gets restarted.

DNS-Based Discovery

DNS is the oldest and most universal form of service discovery. Every programming language, every framework, and every operating system speaks DNS. Using DNS for service discovery means zero library dependencies for clients.

DNS SRV Records

Standard A/AAAA records only resolve hostnames to IP addresses. SRV records add port information and priority/weight for load balancing:

# SRV record format:
# _service._proto.name  TTL  class  SRV  priority  weight  port  target
_http._tcp.payment.prod.example.com. 30 IN SRV 10 60 8080 i-001.example.com.
_http._tcp.payment.prod.example.com. 30 IN SRV 10 30 8080 i-002.example.com.
_http._tcp.payment.prod.example.com. 30 IN SRV 20 10 8080 i-003.example.com.

# priority 10 instances are preferred over priority 20
# within same priority, weight determines distribution:
# i-001 gets ~67% traffic (60/90), i-002 gets ~33% (30/90)

CoreDNS in Kubernetes

CoreDNS is the default DNS server in Kubernetes clusters. It watches the Kubernetes API for Service and Endpoint changes and automatically serves DNS records:

# Every Kubernetes Service gets a DNS entry:
# ..svc.cluster.local

# ClusterIP Service — returns the virtual IP
$ nslookup payment-service.production.svc.cluster.local
Server:    10.96.0.10     # CoreDNS
Address:   10.96.0.10#53
Name:      payment-service.production.svc.cluster.local
Address:   10.100.45.123  # ClusterIP (virtual IP)

# Headless Service (clusterIP: None) — returns pod IPs directly
$ nslookup payment-headless.production.svc.cluster.local
Name:      payment-headless.production.svc.cluster.local
Address:   10.244.1.15    # Pod 1
Address:   10.244.2.23    # Pod 2
Address:   10.244.3.8     # Pod 3

# SRV records for named ports
$ dig _http._tcp.payment-headless.production.svc.cluster.local SRV
# Returns port and target for each pod

Pros and Cons of DNS-Based Discovery

ProsCons
Universal — every language/framework supports DNSTTL caching causes stale results; clients may hit dead instances
No special client libraries neededDNS doesn’t support sophisticated load balancing
Works across organizational boundariesNo built-in health checking (must be layered on top)
Low operational overheadSRV records are not universally supported by all HTTP clients
The TTL Problem: If you set TTL too high, clients cache stale IPs. If you set it too low, you flood the DNS server with queries. Kubernetes sets TTL to 30 seconds by default. Some applications (especially JVM-based) cache DNS results indefinitely unless explicitly configured not to (networkaddress.cache.ttl=10 in Java).

Service Mesh

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It moves discovery, load balancing, encryption, observability, and retries out of the application code and into a network proxy (sidecar) that runs alongside every service instance.

Architecture: Data Plane vs Control Plane

Every service mesh has two components:

# Istio VirtualService — traffic management with service discovery
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
  - payment-service       # service discovery name
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: payment-service
        subset: v2           # canary version
      weight: 100
  - route:
    - destination:
        host: payment-service
        subset: v1           # stable version
      weight: 90
    - destination:
        host: payment-service
        subset: v2           # canary gets 10% traffic
      weight: 10

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 100
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    tls:
      mode: ISTIO_MUTUAL    # automatic mTLS
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

Major Service Meshes

FeatureIstioLinkerdConsul Connect
Sidecar proxyEnvoylinkerd2-proxy (Rust)Envoy (or built-in)
ComplexityHighLowMedium
Performance overhead~3-5ms p99 latency added~1-2ms p99 latency added~2-4ms p99 latency added
mTLSYes (automatic)Yes (automatic)Yes (intentions)
Traffic splittingYes (VirtualService)Yes (TrafficSplit)Yes (service-splitter)
Multi-clusterYesYesYes (WAN federation)

When Is the Overhead Worth It?

Service mesh is not a replacement for service discovery. It builds on top of service discovery. Istio uses Kubernetes service discovery under the hood. The sidecar proxy consumes the same service registry data — it just handles the routing, retries, and encryption transparently.

Kubernetes Service Discovery

Kubernetes has the most comprehensive built-in service discovery of any orchestration platform. When you create a Service object, Kubernetes automatically creates DNS entries, manages endpoints, and load-balances traffic.

Service Types

ClusterIP (Default)

Creates a virtual IP (ClusterIP) accessible only within the cluster. kube-proxy programs iptables/IPVS rules to distribute traffic to backend pods.

apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: production
spec:
  type: ClusterIP              # default
  selector:
    app: payment               # matches pod labels
    version: v2
  ports:
  - name: http
    port: 80                   # service port (virtual)
    targetPort: 8080           # container port (actual)
    protocol: TCP
  - name: grpc
    port: 9090
    targetPort: 9090

# Access from any pod in the cluster:
# http://payment-service.production.svc.cluster.local:80
# or simply: http://payment-service:80 (within same namespace)

Headless Service (clusterIP: None)

No virtual IP is assigned. DNS returns individual pod IPs directly. Used for stateful workloads (databases, Kafka brokers) where clients need to connect to specific pods.

apiVersion: v1
kind: Service
metadata:
  name: cassandra
  namespace: data
spec:
  clusterIP: None              # headless!
  selector:
    app: cassandra
  ports:
  - port: 9042

# DNS returns all pod IPs:
# cassandra.data.svc.cluster.local → 10.244.1.5, 10.244.2.8, 10.244.3.12
# Individual pods are addressable:
# cassandra-0.cassandra.data.svc.cluster.local → 10.244.1.5
# cassandra-1.cassandra.data.svc.cluster.local → 10.244.2.8

NodePort

Exposes the service on a static port on every node’s IP. External traffic can reach the service via <NodeIP>:<NodePort>.

apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  type: NodePort
  selector:
    app: payment
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30080           # accessible at <any-node-ip>:30080
    # range: 30000-32767

LoadBalancer

Provisions a cloud provider’s load balancer (AWS NLB/ALB, GCP LB, Azure LB) that routes external traffic to the service.

apiVersion: v1
kind: Service
metadata:
  name: payment-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
spec:
  type: LoadBalancer
  selector:
    app: payment
  ports:
  - port: 443
    targetPort: 8080

Ingress

An Ingress is not a Service type but a separate resource that provides HTTP/HTTPS routing, TLS termination, and name-based virtual hosting. An Ingress Controller (NGINX, Traefik, HAProxy) watches Ingress resources and configures the actual routing.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.example.com
    secretName: api-tls
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /payments
        pathType: Prefix
        backend:
          service:
            name: payment-service
            port:
              number: 80
      - path: /orders
        pathType: Prefix
        backend:
          service:
            name: order-service
            port:
              number: 80
      - path: /inventory
        pathType: Prefix
        backend:
          service:
            name: inventory-service
            port:
              number: 80

How Pods Find Each Other

Kubernetes provides two mechanisms for pods to discover services:

1. Environment Variables

When a pod starts, kubelet injects environment variables for every active Service in the same namespace:

# Automatically injected into every pod:
PAYMENT_SERVICE_SERVICE_HOST=10.100.45.123
PAYMENT_SERVICE_SERVICE_PORT=80
PAYMENT_SERVICE_PORT=tcp://10.100.45.123:80
PAYMENT_SERVICE_PORT_80_TCP_ADDR=10.100.45.123
PAYMENT_SERVICE_PORT_80_TCP_PORT=80

# Limitation: only services that exist BEFORE the pod starts
# get injected. Services created after pod start are invisible.

2. DNS (Preferred)

CoreDNS dynamically updates as Services and Endpoints change. No restart needed. This is the recommended approach:

# In application code — just use the DNS name
import requests

# Same namespace — short name works
response = requests.get("http://payment-service:80/api/charge")

# Different namespace — use FQDN
response = requests.get(
    "http://payment-service.production.svc.cluster.local:80/api/charge"
)

# Headless service — get all pod IPs
import dns.resolver
answers = dns.resolver.resolve(
    'cassandra.data.svc.cluster.local', 'A'
)
pod_ips = [str(rdata) for rdata in answers]
# ['10.244.1.5', '10.244.2.8', '10.244.3.12']

The Endpoints and EndpointSlice Objects

Behind every Service is an Endpoints object (or EndpointSlice in modern K8s) that tracks which pod IPs are ready to receive traffic. When a pod fails its readiness probe, it is removed from the EndpointSlice, and kube-proxy/CoreDNS stop routing traffic to it.

# View endpoints for a service
$ kubectl get endpoints payment-service -n production
NAME              ENDPOINTS                                       AGE
payment-service   10.244.1.15:8080,10.244.2.23:8080,10.244.3.8:8080   5d

# EndpointSlice (K8s 1.21+, more scalable)
$ kubectl get endpointslice -l kubernetes.io/service-name=payment-service
NAME                      ADDRESSTYPE   PORTS   ENDPOINTS           AGE
payment-service-abc12     IPv4          8080    10.244.1.15 + 2...  5d

Summary

With service discovery mastered, we can explore the infrastructure that sits between clients and services: Proxies — Forward, Reverse, and Beyond.