High Level Design Series · Foundations · Part 2

Scaling: Vertical vs Horizontal

April 2026 · 18 min read

Why Scaling Matters

Every successful application starts small — a single server, a modest database, a handful of users. But success brings traffic. And traffic, if left unmanaged, breaks things. Scaling is the engineering discipline of ensuring your system can handle growing load without degrading the user experience.

Growth Scenarios

Consider these real-world triggers that demand scaling decisions:

Organic growth: Your user base doubles every quarter. What handles 10,000 daily active users today needs to serve 80,000 within a year.
Traffic spikes: A product launch, a viral tweet, a Black Friday sale — sudden 10–50× surges that can bring down an unprepared system in minutes.
Data growth: A social media platform generates 500 TB of new data per day. Storage and query performance can't remain static.
Geographic expansion: Serving users across continents introduces latency that a single-region deployment can't solve.

What Happens When a Single Server Can't Handle Load?

When a single server is overwhelmed, the failure cascade is predictable and devastating:

Response times spike. A typical API call that takes 50ms under normal load suddenly takes 2–5 seconds as the CPU saturates and the request queue backs up.
Timeouts and errors. Clients start seeing HTTP 503 (Service Unavailable) and 504 (Gateway Timeout) errors. Connection pools exhaust.
Cascading failures. Backed-up requests consume memory. The OS starts swapping to disk. Database connections time out, causing the application layer to retry — amplifying the load further.
Complete outage. The server becomes unresponsive. The OOM killer terminates processes. Users see blank pages.

Key insight: Scaling isn't just about handling more users — it's about maintaining performance guarantees as load increases. A system that technically handles 1 million requests but with 30-second response times is effectively broken.

How Startups Evolve: From Single Server to Distributed

Most successful startups follow a remarkably similar scaling trajectory:

Phase 1 — Single box (0–1K users): Everything runs on one machine — app server, database, file storage. A $20/month VPS handles it fine. This is the right choice: shipping fast matters more than architecture.
Phase 2 — Separate the database (1K–10K users): The database gets its own server. Now the app server has more CPU and RAM for handling requests, and the DB server has dedicated I/O bandwidth.
Phase 3 — Add a load balancer (10K–100K users): A second app server sits behind a load balancer. Read replicas appear for the database. A Redis cache reduces DB hits by 70–80%.
Phase 4 — Service decomposition (100K–1M users): The monolith splits into services. Background jobs move to worker queues. CDN handles static assets. Auto-scaling policies manage compute.
Phase 5 — Full distributed system (1M+ users): Database sharding, multi-region deployment, event-driven architecture, dedicated teams per service. This is Instagram, Uber, Netflix territory.

Vertical Scaling (Scale Up)

Vertical scaling means upgrading the hardware of your existing machine — more CPU cores, more RAM, faster SSDs, better network cards. You're making one server more powerful rather than adding more servers.

How It Works

In practice, vertical scaling is the simplest scaling strategy:

Your PostgreSQL database is struggling with query latency? Upgrade from 16 GB RAM to 64 GB — now the entire working set fits in memory.
CPU-bound workloads maxing out 4 cores? Move to a 16-core machine.
Disk I/O bottleneck? Replace HDD with NVMe SSDs that deliver 500K+ IOPS instead of 200.

Advantages

Advantage	Why It Matters
Zero code changes	Your application doesn't need to know it's running on a bigger machine. No distributed system complexity.
No data partitioning	All data lives on one machine. No sharding logic, no cross-node joins, no distributed transactions.
ACID simplicity	Single-node databases give you strong consistency for free. No CAP theorem headaches.
Lower operational cost	One server to monitor, backup, patch, and secure. One point of failure, but also one point of management.
Inter-process communication	Function calls instead of network calls. Nanoseconds instead of milliseconds.

Limitations

Hardware ceiling: The largest cloud instances cap out. AWS's u-24tb1.112xlarge has 448 vCPUs and 24 TB RAM — impressive, but finite. You can't add a 449th vCPU.
Single point of failure: If that one powerful server goes down, everything goes down. No redundancy by default.
Exponential cost: Doubling resources doesn't just double the price. An r6g.16xlarge (512 GB RAM) costs roughly 4× more than two r6g.8xlarge instances (256 GB RAM each) combined.
Downtime during upgrades: Resizing a server typically requires a reboot — minutes of downtime per upgrade event.
Diminishing returns: Beyond a certain point, the bottleneck shifts. Doubling RAM won't help if the problem is CPU-bound.

When to use vertical scaling: Databases (especially relational), stateful services that are hard to distribute, early-stage startups (under 10K users), and situations where development speed trumps scalability architecture.

Real-World AWS Instance Progression

Here's a concrete scaling path using AWS EC2:

Instance	vCPUs	RAM	~Cost/month	Use Case
`t3.micro`	2	1 GB	$8	Dev/test, personal projects
`t3.large`	2	8 GB	$60	Small production apps
`m6i.xlarge`	4	16 GB	$140	Medium traffic web apps
`m6i.4xlarge`	16	64 GB	$560	Databases, heavy compute
`r6i.16xlarge`	64	512 GB	$3,200	Large in-memory workloads
`x2idn.24xlarge`	96	1.5 TB	$14,000	SAP HANA, massive databases
`u-24tb1.112xlarge`	448	24 TB	$200,000+	The ceiling — you've run out of vertical

Horizontal Scaling (Scale Out)

Horizontal scaling means adding more machines to your pool and distributing the workload across them. Instead of one powerful server, you run many smaller servers behind a load balancer.

How It Works

The fundamental idea is simple: if one server handles 1,000 requests per second (RPS), ten identical servers behind a load balancer handle ~10,000 RPS. In practice, it's more nuanced, but this linear-ish relationship is the core value proposition.

                    ┌─────────────┐
                    │   Clients   │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ Load Balancer│
                    └──┬───┬───┬──┘
                       │   │   │
              ┌────────┘   │   └────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │ Server 1 │ │ Server 2 │ │ Server 3 │
        └──────────┘ └──────────┘ └──────────┘
              │            │            │
              └────────┬───┴────────────┘
                       │
                ┌──────▼──────┐
                │  Database   │
                └─────────────┘

Advantages

Advantage	Why It Matters
No single point of failure	If one server dies, the load balancer routes traffic to healthy servers. Users never notice.
Linear cost scaling	Need 2× capacity? Add 2× servers. Cost scales proportionally, not exponentially.
Theoretically unlimited	There's no hardware ceiling — you can always add more machines to the pool.
Rolling deployments	Update servers one at a time. Zero downtime deployments become possible.
Geographic distribution	Place servers in multiple regions to reduce latency for global users.

Challenges

State management: If Server 1 stores a user's session in local memory and the next request goes to Server 2, the session is lost. This is the fundamental challenge of horizontal scaling.
Data consistency: With multiple servers reading and writing shared data, you face race conditions, stale reads, and the need for distributed locking or eventual consistency models.
Network overhead: What was a local function call becomes a network round-trip (1–5ms within a datacenter, 50–150ms cross-region). This adds up fast.
Operational complexity: Monitoring, deploying, debugging, and securing 50 servers is fundamentally harder than managing one.
Data partitioning: At scale, your database itself needs to be split across machines (sharding), introducing significant complexity.

Session Management Strategies

When you scale horizontally, the question of "where does user state live?" becomes critical. Here are the main approaches:

1. Sticky Sessions (Session Affinity)

The load balancer remembers which server handles each user and always routes that user's requests to the same server. Typically implemented via cookies or IP hashing.

Pros: Simple to implement, works with stateful apps without code changes.
Cons: Uneven load distribution (one server might have all the heavy users), server failure loses all its sessions, can't auto-scale gracefully.

2. External Session Store (Recommended)

Store sessions in a shared external store (Redis, Memcached, or a database) that all servers can access. Each server is stateless — any server can handle any request.

// Express.js with Redis session store
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');

const redisClient = createClient({ url: 'redis://session-store:6379' });
await redisClient.connect();

app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET,
  resave: false,
  saveUninitialized: false,
  cookie: { maxAge: 3600000 }  // 1 hour
}));

3. Client-Side Sessions (JWT)

Encode session data into a signed token (JWT) stored in the client's browser. The server is completely stateless — it just verifies the token signature.

Pros: Zero server-side storage, scales infinitely.
Cons: Can't revoke tokens without a blacklist (which reintroduces server state), token size limits, sensitive data exposure risk.

▶ Scaling Visualization

Step through to see vertical and horizontal scaling in action.

Stateless vs Stateful Architecture

The distinction between stateless and stateful services is perhaps the most important concept in horizontal scaling. It determines how easily your system can grow.

The Problem with Stateful Services

A stateful service stores client-specific data in local memory between requests. A classic example: a web server that keeps user sessions in a local HashMap.

// Stateful — session stored in-process memory
const sessions = {};  // local to THIS server

app.post('/login', (req, res) => {
  const sessionId = generateId();
  sessions[sessionId] = { userId: req.body.userId, cart: [] };
  res.cookie('sid', sessionId);
  res.json({ ok: true });
});

app.get('/cart', (req, res) => {
  const session = sessions[req.cookies.sid]; // only works on THIS server!
  if (!session) return res.status(401).json({ error: 'Session not found' });
  res.json(session.cart);
});

This breaks the moment you add a second server. The user logs in on Server 1 (which stores their session), but the load balancer sends their next request to Server 2 (which has no knowledge of that session). Result: 401 Unauthorized.

Designing for Statelessness

A stateless service treats each request as independent. It carries no memory of previous interactions. All necessary state is either:

Included in the request itself (JWT tokens, request parameters).
Stored in an external shared system (Redis, database, S3).

// Stateless — session stored in Redis (shared across all servers)
const redis = require('redis').createClient({ url: 'redis://cache:6379' });

app.post('/login', async (req, res) => {
  const sessionId = generateId();
  await redis.hSet(`session:${sessionId}`, {
    userId: req.body.userId,
    cart: JSON.stringify([])
  });
  await redis.expire(`session:${sessionId}`, 3600);
  res.cookie('sid', sessionId);
  res.json({ ok: true });
});

app.get('/cart', async (req, res) => {
  const session = await redis.hGetAll(`session:${req.cookies.sid}`);
  if (!session.userId) return res.status(401).json({ error: 'Session expired' });
  res.json(JSON.parse(session.cart));
});

Now any server in the pool can handle any request. The load balancer has complete freedom. Servers can be added, removed, or replaced without any coordination.

Shared-Nothing Architecture

The gold standard for horizontally scalable systems is shared-nothing architecture: each node is fully independent and self-sufficient. Nodes don't share memory, disk, or any other resource. They communicate only through well-defined network APIs.

This architecture gives you:

Fault isolation: A crash in one node doesn't affect others.
Linear scalability: Adding a node adds proportional capacity.
Independent deployment: Update nodes one at a time.

Google's original infrastructure, Amazon's Dynamo, and Cassandra all follow shared-nothing principles.

Rule of thumb: If you can restart any server in your cluster and users don't notice, congratulations — you've achieved statelessness. If restarting a server logs out users or loses data, you have hidden state to extract.

Database Scaling

While application servers are relatively easy to scale horizontally (make them stateless, add more behind a load balancer), databases are the real scaling bottleneck. They hold state by definition, and distributing state is hard.

Read Replicas

Most applications are read-heavy (80–95% reads). Read replicas let you distribute read traffic across multiple copies of your database while sending all writes to a single primary.

                  ┌───────────────┐
    Writes ──────►│  Primary (RW) │
                  └───────┬───────┘
                    │     │     │    Replication
              ┌─────┘     │     └─────┐
              ▼           ▼           ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
   Reads│ Replica 1│ │ Replica 2│ │ Replica 3│
        └──────────┘ └──────────┘ └──────────┘

Replication lag: Replicas can be milliseconds to seconds behind the primary. This is fine for displaying a news feed, but not for showing a user the item they just added to their cart. Route read-after-write scenarios to the primary.
AWS RDS supports up to 15 read replicas for MySQL/PostgreSQL, with automatic failover promotion.

Vertical Scaling for Writes

Write scaling is harder. In a single-primary setup, all writes funnel through one node. Your options:

Upgrade the primary to a bigger instance (vertical scaling).
Optimize write paths — batch inserts, reduce indexes, use append-only tables.
Move to a multi-primary setup (complex, conflict resolution needed).
Shard the database (next section).

Sharding (Horizontal Partitioning)

Sharding splits your data across multiple database servers, each holding a subset. Each shard is a fully independent database instance.

Strategy	How It Works	Trade-off
Range-based	Users A–M on Shard 1, N–Z on Shard 2	Uneven distribution (some ranges are hotter)
Hash-based	`shard = hash(userId) % num_shards`	Even distribution, but resharding is painful
Directory-based	A lookup service maps each key to its shard	Flexible, but the directory is a bottleneck/SPOF
Consistent hashing	Hash ring minimizes redistribution on shard changes	Best for dynamic scaling; more complex to implement

We'll cover sharding in deep detail in a dedicated post later in this series.

Connection Pooling

Each database connection consumes ~10 MB of memory on the server side. With 50 app servers each opening 20 connections, that's 1,000 connections × 10 MB = 10 GB just for connections. Connection poolers like PgBouncer (for PostgreSQL) or ProxySQL (for MySQL) sit between your app and database, multiplexing many app connections over fewer database connections.

# PgBouncer config snippet
[databases]
mydb = host=primary.db.internal port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction    # connections returned after each transaction
max_client_conn = 2000     # accept up to 2000 app connections
default_pool_size = 50     # but only use 50 actual DB connections

Caching Layer

The fastest database query is the one you never make. A caching layer (typically Redis or Memcached) can absorb 80–95% of read traffic, dramatically reducing database load.

Cache-aside pattern: App checks cache first. On miss, reads from DB, stores in cache, returns. On write, invalidate the cache entry.
Cache hit ratio: If 90% of reads hit cache and your cache serves 100K RPS, only 10K RPS reach the database. This is the difference between needing a massive DB cluster and a modest one.
TTL strategy: Set expiration times based on data freshness requirements. User profiles: 5 min. Product catalog: 1 hour. Configuration: 24 hours.

Auto-Scaling

Manual scaling doesn't work at internet scale. You need systems that automatically add or remove capacity based on real-time demand. This is auto-scaling — letting infrastructure respond to load as it happens.

Reactive vs Predictive Auto-Scaling

Type	How It Works	Best For
Reactive	Monitors metrics in real-time. When CPU exceeds 70% for 5 minutes, add 2 instances.	Unpredictable traffic patterns, general-purpose scaling
Predictive	Uses ML on historical data to predict load. Pre-scales 15 minutes before the expected spike.	Predictable patterns (daily peaks, weekly cycles)
Scheduled	Manually configured time-based rules. Scale up at 8 AM, scale down at 10 PM.	Known events (business hours, batch jobs)

Scaling Policies

AWS Auto Scaling supports several policy types:

Target Tracking: "Keep average CPU at 60%." AWS automatically adjusts instance count. This is the simplest and most recommended approach.
Step Scaling: "If CPU > 70%, add 2 instances. If CPU > 90%, add 5 instances." Graduated response to different load levels.
Simple Scaling: "If CPU > 70%, add 1 instance." Waits for cool-down before any further action. Simplest but slowest to respond.
Custom Metrics: Scale based on application-specific metrics like queue depth, request latency, or active WebSocket connections.

Cool-Down Periods

Without cool-down, auto-scaling oscillates rapidly:

CPU hits 80% → add 2 instances.
New instances haven't warmed up yet → CPU still high → add 2 more.
All instances warm up → CPU drops to 20% → remove 3 instances.
Remaining instances overwhelmed → CPU spikes → add more instances.

Cool-down periods (typically 300 seconds) prevent this thrashing by blocking scaling actions after a recent change, giving the system time to stabilize.

AWS Auto Scaling Group Example

# CloudFormation snippet for an Auto Scaling Group
MyASG:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    MinSize: 2           # Never fewer than 2 (high availability)
    MaxSize: 20          # Cost guardrail
    DesiredCapacity: 4   # Start with 4
    HealthCheckType: ELB
    HealthCheckGracePeriod: 120
    LaunchTemplate:
      LaunchTemplateId: !Ref MyLaunchTemplate
      Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
    TargetGroupARNs:
      - !Ref MyTargetGroup

CPUScalingPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AutoScalingGroupName: !Ref MyASG
    PolicyType: TargetTrackingScaling
    TargetTrackingConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ASGAverageCPUUtilization
      TargetValue: 60.0    # Keep CPU around 60%
      ScaleInCooldown: 300
      ScaleOutCooldown: 120

Cost Optimization with Spot Instances

AWS Spot Instances offer up to 90% discount over On-Demand pricing, but can be reclaimed with 2-minute notice. Smart auto-scaling strategies blend instance types:

Base capacity: 2–4 On-Demand instances (always available, minimum viable capacity).
Burst capacity: Spot instances for the remaining capacity. Use multiple instance types (m5.xlarge, m5a.xlarge, m6i.xlarge) to reduce interruption risk.
Fallback: If Spot isn't available, fall back to On-Demand. AWS Mixed Instances Policy handles this automatically.

A typical SaaS application spending $50K/month on EC2 can reduce this to ~$20K with an aggressive Spot strategy.

Scaling Patterns: The Scale Cube

In 2009, Martin Abbott and Michael Fisher introduced the Scale Cube — a framework that categorizes all scaling approaches into three axes. Every scaling decision you make falls along one (or more) of these axes.

X-Axis: Horizontal Cloning

Run N identical copies of your application behind a load balancer. Each copy can handle any request. This is the most common and simplest scaling approach.

Example: 10 identical web servers behind an NGINX load balancer.
Pros: Simple, no code changes, works for most apps.
Cons: Every server needs access to all data. Doesn't help with data scaling.
Scales: Compute capacity (CPU, memory).

Y-Axis: Functional Decomposition

Split your monolith into smaller services, each responsible for a specific function. This is the microservices approach.

Example: Separate services for User Auth, Product Catalog, Order Processing, Payments, and Notifications.
Pros: Each service scales independently. The payment service can scale to 50 instances during checkout surges while the auth service stays at 5.
Cons: Network complexity, distributed transactions, deployment orchestration.
Scales: Development velocity (independent teams), operational flexibility.

Z-Axis: Data Partitioning (Sharding)

Each server runs the same code but is responsible for only a subset of data. A router sends each request to the correct shard based on a partition key.

Example: User data split across 16 shards by hash(userId) % 16. User #12345 always goes to shard 7.
Pros: Scales data storage and write throughput. Each shard's dataset fits in memory.
Cons: Cross-shard queries are expensive. Rebalancing shards is painful. Application needs routing logic.
Scales: Data volume, write throughput.

Axis	What It Scales	Mechanism	Complexity
X (Clone)	Request throughput	Identical replicas + load balancer	Low
Y (Decompose)	Feature complexity	Microservices	Medium-High
Z (Partition)	Data volume & writes	Sharding by partition key	High

In practice: Most mature systems use all three axes simultaneously. Netflix uses X-axis (many replicas of each service), Y-axis (hundreds of microservices), and Z-axis (sharded data stores like Cassandra) — all at the same time.

▶ The Scale Cube

Step through each scaling axis: X (clone), Y (decompose), Z (partition).

Real-World Case Studies

Instagram: From 1 to 400M Users

When Instagram launched in October 2010, it had 2 engineers and a single server. Within 2 years, it had 100 million users. Here's how they scaled:

Phase 1 (Launch): 1 Django app server + 1 PostgreSQL database on a single machine at a hosting provider. Total cost: ~$100/month.
Phase 2 (1M users, 2 months in): Migrated to AWS. 3 NGINX + Django app servers behind an ELB. PostgreSQL on a dedicated db.m1.xlarge (vertical scaling). Added Redis for caching and Memcached for view counts.
Phase 3 (10M users): Read replicas for PostgreSQL. Moved media uploads to S3 + CloudFront CDN. Task queue (Celery + RabbitMQ) for background jobs like push notifications and feed fan-out.
Phase 4 (100M+ users): Horizontal PostgreSQL sharding using a custom shard-aware ORM. 12 PostgreSQL shards, each on an r3.8xlarge. Cassandra for feed storage. ~25 Django app servers.
Key principle: Instagram kept their stack simple (Python/Django, PostgreSQL, Redis, Memcached) and scaled operations rather than chasing trendy technologies. They famously did not adopt microservices until years after the Facebook acquisition.

Instagram's #1 rule: "Do the simplest thing that will work." They vertically scaled PostgreSQL as far as it would go before sharding, and horizontally scaled stateless Django servers as the first scaling move.

Netflix: Monolith to Microservices Migration

Netflix's migration from a single-datacenter monolith to a cloud-native microservices architecture is the canonical scaling story:

2007: Netflix ran a monolithic Java application on its own hardware. A database corruption event caused 3 days of downtime — the catalyst for change.
2008–2009: Began migrating to AWS. Started with non-critical services (encoding, analytics) to build confidence.
2010–2011: Decomposed the monolith into microservices. Each service owned its own data store. Built custom infrastructure tools: Eureka (service discovery), Zuul (API gateway), Hystrix (circuit breaker).
2012: Completed the migration. 100% of streaming traffic served from AWS. Shut down the last datacenter.
Today: 700+ microservices, serving 200+ million subscribers across 190 countries. Each service is independently deployed (thousands of deployments per day). They handle peak traffic of ~400 Gbps during evening hours.

Key scaling decisions:

Y-axis scaling (microservices) for development velocity — 2,000+ engineers working independently.
X-axis scaling (auto-scaling groups) for handling traffic peaks.
Z-axis scaling (Cassandra with consistent hashing) for data volume.
Chaos engineering (Chaos Monkey, Chaos Kong) to ensure resilience at scale.
Multi-region deployment (us-east-1, eu-west-1, ap-southeast-2) for latency and disaster recovery.

Lessons Learned

Lesson	Details
Start simple	Both Instagram and Netflix started with monoliths. Don't prematurely optimize for scale you don't have.
Scale the bottleneck	Profile first. If 90% of load is database reads, add read replicas and caching before adding more app servers.
Statelessness first	Making your application tier stateless is the highest-leverage scaling move. It unlocks horizontal scaling, auto-scaling, and rolling deploys.
Vertical before horizontal	Vertical scaling is simpler and cheaper at moderate scale. Instagram ran a single PostgreSQL instance for millions of users.
Automate everything	Manual scaling doesn't work at internet scale. Invest in auto-scaling, CI/CD, and infrastructure-as-code early.
Design for failure	At scale, hardware failure is not an exception — it's a constant. Netflix designed for it with Chaos Monkey; Instagram relied on AWS managed failover.

Summary

Scaling is the fundamental challenge of system design. Here's the decision framework:

Vertical first — it's simple, cheap, and gets you far. Most apps never outgrow a well-provisioned single server with a read replica.
Make services stateless — this is the prerequisite for everything else. Extract state to Redis/databases.
Horizontal for compute — stateless app servers behind a load balancer with auto-scaling.
Cache aggressively — reduce database load by 80–95% before thinking about database scaling.
Shard data as a last resort — sharding introduces permanent complexity. Exhaust all other options first.
Use the Scale Cube — think about X (clone), Y (decompose), and Z (partition) as complementary strategies.

In the next post, we'll dive deep into Load Balancing — the critical piece that makes horizontal scaling work.