← All Posts
High Level Design Series · Foundations · Part 2

Scaling: Vertical vs Horizontal

Why Scaling Matters

Every successful application starts small — a single server, a modest database, a handful of users. But success brings traffic. And traffic, if left unmanaged, breaks things. Scaling is the engineering discipline of ensuring your system can handle growing load without degrading the user experience.

Growth Scenarios

Consider these real-world triggers that demand scaling decisions:

What Happens When a Single Server Can't Handle Load?

When a single server is overwhelmed, the failure cascade is predictable and devastating:

  1. Response times spike. A typical API call that takes 50ms under normal load suddenly takes 2–5 seconds as the CPU saturates and the request queue backs up.
  2. Timeouts and errors. Clients start seeing HTTP 503 (Service Unavailable) and 504 (Gateway Timeout) errors. Connection pools exhaust.
  3. Cascading failures. Backed-up requests consume memory. The OS starts swapping to disk. Database connections time out, causing the application layer to retry — amplifying the load further.
  4. Complete outage. The server becomes unresponsive. The OOM killer terminates processes. Users see blank pages.
Key insight: Scaling isn't just about handling more users — it's about maintaining performance guarantees as load increases. A system that technically handles 1 million requests but with 30-second response times is effectively broken.

How Startups Evolve: From Single Server to Distributed

Most successful startups follow a remarkably similar scaling trajectory:

  1. Phase 1 — Single box (0–1K users): Everything runs on one machine — app server, database, file storage. A $20/month VPS handles it fine. This is the right choice: shipping fast matters more than architecture.
  2. Phase 2 — Separate the database (1K–10K users): The database gets its own server. Now the app server has more CPU and RAM for handling requests, and the DB server has dedicated I/O bandwidth.
  3. Phase 3 — Add a load balancer (10K–100K users): A second app server sits behind a load balancer. Read replicas appear for the database. A Redis cache reduces DB hits by 70–80%.
  4. Phase 4 — Service decomposition (100K–1M users): The monolith splits into services. Background jobs move to worker queues. CDN handles static assets. Auto-scaling policies manage compute.
  5. Phase 5 — Full distributed system (1M+ users): Database sharding, multi-region deployment, event-driven architecture, dedicated teams per service. This is Instagram, Uber, Netflix territory.

Vertical Scaling (Scale Up)

Vertical scaling means upgrading the hardware of your existing machine — more CPU cores, more RAM, faster SSDs, better network cards. You're making one server more powerful rather than adding more servers.

How It Works

In practice, vertical scaling is the simplest scaling strategy:

Advantages

AdvantageWhy It Matters
Zero code changesYour application doesn't need to know it's running on a bigger machine. No distributed system complexity.
No data partitioningAll data lives on one machine. No sharding logic, no cross-node joins, no distributed transactions.
ACID simplicitySingle-node databases give you strong consistency for free. No CAP theorem headaches.
Lower operational costOne server to monitor, backup, patch, and secure. One point of failure, but also one point of management.
Inter-process communicationFunction calls instead of network calls. Nanoseconds instead of milliseconds.

Limitations

When to use vertical scaling: Databases (especially relational), stateful services that are hard to distribute, early-stage startups (under 10K users), and situations where development speed trumps scalability architecture.

Real-World AWS Instance Progression

Here's a concrete scaling path using AWS EC2:

InstancevCPUsRAM~Cost/monthUse Case
t3.micro21 GB$8Dev/test, personal projects
t3.large28 GB$60Small production apps
m6i.xlarge416 GB$140Medium traffic web apps
m6i.4xlarge1664 GB$560Databases, heavy compute
r6i.16xlarge64512 GB$3,200Large in-memory workloads
x2idn.24xlarge961.5 TB$14,000SAP HANA, massive databases
u-24tb1.112xlarge44824 TB$200,000+The ceiling — you've run out of vertical

Horizontal Scaling (Scale Out)

Horizontal scaling means adding more machines to your pool and distributing the workload across them. Instead of one powerful server, you run many smaller servers behind a load balancer.

How It Works

The fundamental idea is simple: if one server handles 1,000 requests per second (RPS), ten identical servers behind a load balancer handle ~10,000 RPS. In practice, it's more nuanced, but this linear-ish relationship is the core value proposition.

                    ┌─────────────┐
                    │   Clients   │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ Load Balancer│
                    └──┬───┬───┬──┘
                       │   │   │
              ┌────────┘   │   └────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │ Server 1 │ │ Server 2 │ │ Server 3 │
        └──────────┘ └──────────┘ └──────────┘
              │            │            │
              └────────┬───┴────────────┘
                       │
                ┌──────▼──────┐
                │  Database   │
                └─────────────┘

Advantages

AdvantageWhy It Matters
No single point of failureIf one server dies, the load balancer routes traffic to healthy servers. Users never notice.
Linear cost scalingNeed 2× capacity? Add 2× servers. Cost scales proportionally, not exponentially.
Theoretically unlimitedThere's no hardware ceiling — you can always add more machines to the pool.
Rolling deploymentsUpdate servers one at a time. Zero downtime deployments become possible.
Geographic distributionPlace servers in multiple regions to reduce latency for global users.

Challenges

Session Management Strategies

When you scale horizontally, the question of "where does user state live?" becomes critical. Here are the main approaches:

1. Sticky Sessions (Session Affinity)

The load balancer remembers which server handles each user and always routes that user's requests to the same server. Typically implemented via cookies or IP hashing.

2. External Session Store (Recommended)

Store sessions in a shared external store (Redis, Memcached, or a database) that all servers can access. Each server is stateless — any server can handle any request.

// Express.js with Redis session store
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');

const redisClient = createClient({ url: 'redis://session-store:6379' });
await redisClient.connect();

app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET,
  resave: false,
  saveUninitialized: false,
  cookie: { maxAge: 3600000 }  // 1 hour
}));

3. Client-Side Sessions (JWT)

Encode session data into a signed token (JWT) stored in the client's browser. The server is completely stateless — it just verifies the token signature.

▶ Scaling Visualization

Step through to see vertical and horizontal scaling in action.

Stateless vs Stateful Architecture

The distinction between stateless and stateful services is perhaps the most important concept in horizontal scaling. It determines how easily your system can grow.

The Problem with Stateful Services

A stateful service stores client-specific data in local memory between requests. A classic example: a web server that keeps user sessions in a local HashMap.

// Stateful — session stored in-process memory
const sessions = {};  // local to THIS server

app.post('/login', (req, res) => {
  const sessionId = generateId();
  sessions[sessionId] = { userId: req.body.userId, cart: [] };
  res.cookie('sid', sessionId);
  res.json({ ok: true });
});

app.get('/cart', (req, res) => {
  const session = sessions[req.cookies.sid]; // only works on THIS server!
  if (!session) return res.status(401).json({ error: 'Session not found' });
  res.json(session.cart);
});

This breaks the moment you add a second server. The user logs in on Server 1 (which stores their session), but the load balancer sends their next request to Server 2 (which has no knowledge of that session). Result: 401 Unauthorized.

Designing for Statelessness

A stateless service treats each request as independent. It carries no memory of previous interactions. All necessary state is either:

// Stateless — session stored in Redis (shared across all servers)
const redis = require('redis').createClient({ url: 'redis://cache:6379' });

app.post('/login', async (req, res) => {
  const sessionId = generateId();
  await redis.hSet(`session:${sessionId}`, {
    userId: req.body.userId,
    cart: JSON.stringify([])
  });
  await redis.expire(`session:${sessionId}`, 3600);
  res.cookie('sid', sessionId);
  res.json({ ok: true });
});

app.get('/cart', async (req, res) => {
  const session = await redis.hGetAll(`session:${req.cookies.sid}`);
  if (!session.userId) return res.status(401).json({ error: 'Session expired' });
  res.json(JSON.parse(session.cart));
});

Now any server in the pool can handle any request. The load balancer has complete freedom. Servers can be added, removed, or replaced without any coordination.

Shared-Nothing Architecture

The gold standard for horizontally scalable systems is shared-nothing architecture: each node is fully independent and self-sufficient. Nodes don't share memory, disk, or any other resource. They communicate only through well-defined network APIs.

This architecture gives you:

Google's original infrastructure, Amazon's Dynamo, and Cassandra all follow shared-nothing principles.

Rule of thumb: If you can restart any server in your cluster and users don't notice, congratulations — you've achieved statelessness. If restarting a server logs out users or loses data, you have hidden state to extract.

Database Scaling

While application servers are relatively easy to scale horizontally (make them stateless, add more behind a load balancer), databases are the real scaling bottleneck. They hold state by definition, and distributing state is hard.

Read Replicas

Most applications are read-heavy (80–95% reads). Read replicas let you distribute read traffic across multiple copies of your database while sending all writes to a single primary.

                  ┌───────────────┐
    Writes ──────►│  Primary (RW) │
                  └───────┬───────┘
                    │     │     │    Replication
              ┌─────┘     │     └─────┐
              ▼           ▼           ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
   Reads│ Replica 1│ │ Replica 2│ │ Replica 3│
        └──────────┘ └──────────┘ └──────────┘

Vertical Scaling for Writes

Write scaling is harder. In a single-primary setup, all writes funnel through one node. Your options:

Sharding (Horizontal Partitioning)

Sharding splits your data across multiple database servers, each holding a subset. Each shard is a fully independent database instance.

StrategyHow It WorksTrade-off
Range-basedUsers A–M on Shard 1, N–Z on Shard 2Uneven distribution (some ranges are hotter)
Hash-basedshard = hash(userId) % num_shardsEven distribution, but resharding is painful
Directory-basedA lookup service maps each key to its shardFlexible, but the directory is a bottleneck/SPOF
Consistent hashingHash ring minimizes redistribution on shard changesBest for dynamic scaling; more complex to implement

We'll cover sharding in deep detail in a dedicated post later in this series.

Connection Pooling

Each database connection consumes ~10 MB of memory on the server side. With 50 app servers each opening 20 connections, that's 1,000 connections × 10 MB = 10 GB just for connections. Connection poolers like PgBouncer (for PostgreSQL) or ProxySQL (for MySQL) sit between your app and database, multiplexing many app connections over fewer database connections.

# PgBouncer config snippet
[databases]
mydb = host=primary.db.internal port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction    # connections returned after each transaction
max_client_conn = 2000     # accept up to 2000 app connections
default_pool_size = 50     # but only use 50 actual DB connections

Caching Layer

The fastest database query is the one you never make. A caching layer (typically Redis or Memcached) can absorb 80–95% of read traffic, dramatically reducing database load.

Auto-Scaling

Manual scaling doesn't work at internet scale. You need systems that automatically add or remove capacity based on real-time demand. This is auto-scaling — letting infrastructure respond to load as it happens.

Reactive vs Predictive Auto-Scaling

TypeHow It WorksBest For
ReactiveMonitors metrics in real-time. When CPU exceeds 70% for 5 minutes, add 2 instances.Unpredictable traffic patterns, general-purpose scaling
PredictiveUses ML on historical data to predict load. Pre-scales 15 minutes before the expected spike.Predictable patterns (daily peaks, weekly cycles)
ScheduledManually configured time-based rules. Scale up at 8 AM, scale down at 10 PM.Known events (business hours, batch jobs)

Scaling Policies

AWS Auto Scaling supports several policy types:

Cool-Down Periods

Without cool-down, auto-scaling oscillates rapidly:

  1. CPU hits 80% → add 2 instances.
  2. New instances haven't warmed up yet → CPU still high → add 2 more.
  3. All instances warm up → CPU drops to 20% → remove 3 instances.
  4. Remaining instances overwhelmed → CPU spikes → add more instances.

Cool-down periods (typically 300 seconds) prevent this thrashing by blocking scaling actions after a recent change, giving the system time to stabilize.

AWS Auto Scaling Group Example

# CloudFormation snippet for an Auto Scaling Group
MyASG:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    MinSize: 2           # Never fewer than 2 (high availability)
    MaxSize: 20          # Cost guardrail
    DesiredCapacity: 4   # Start with 4
    HealthCheckType: ELB
    HealthCheckGracePeriod: 120
    LaunchTemplate:
      LaunchTemplateId: !Ref MyLaunchTemplate
      Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
    TargetGroupARNs:
      - !Ref MyTargetGroup

CPUScalingPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AutoScalingGroupName: !Ref MyASG
    PolicyType: TargetTrackingScaling
    TargetTrackingConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ASGAverageCPUUtilization
      TargetValue: 60.0    # Keep CPU around 60%
      ScaleInCooldown: 300
      ScaleOutCooldown: 120

Cost Optimization with Spot Instances

AWS Spot Instances offer up to 90% discount over On-Demand pricing, but can be reclaimed with 2-minute notice. Smart auto-scaling strategies blend instance types:

A typical SaaS application spending $50K/month on EC2 can reduce this to ~$20K with an aggressive Spot strategy.

Scaling Patterns: The Scale Cube

In 2009, Martin Abbott and Michael Fisher introduced the Scale Cube — a framework that categorizes all scaling approaches into three axes. Every scaling decision you make falls along one (or more) of these axes.

X-Axis: Horizontal Cloning

Run N identical copies of your application behind a load balancer. Each copy can handle any request. This is the most common and simplest scaling approach.

Y-Axis: Functional Decomposition

Split your monolith into smaller services, each responsible for a specific function. This is the microservices approach.

Z-Axis: Data Partitioning (Sharding)

Each server runs the same code but is responsible for only a subset of data. A router sends each request to the correct shard based on a partition key.

AxisWhat It ScalesMechanismComplexity
X (Clone)Request throughputIdentical replicas + load balancerLow
Y (Decompose)Feature complexityMicroservicesMedium-High
Z (Partition)Data volume & writesSharding by partition keyHigh
In practice: Most mature systems use all three axes simultaneously. Netflix uses X-axis (many replicas of each service), Y-axis (hundreds of microservices), and Z-axis (sharded data stores like Cassandra) — all at the same time.

▶ The Scale Cube

Step through each scaling axis: X (clone), Y (decompose), Z (partition).

Real-World Case Studies

Instagram: From 1 to 400M Users

When Instagram launched in October 2010, it had 2 engineers and a single server. Within 2 years, it had 100 million users. Here's how they scaled:

Instagram's #1 rule: "Do the simplest thing that will work." They vertically scaled PostgreSQL as far as it would go before sharding, and horizontally scaled stateless Django servers as the first scaling move.

Netflix: Monolith to Microservices Migration

Netflix's migration from a single-datacenter monolith to a cloud-native microservices architecture is the canonical scaling story:

Key scaling decisions:

Lessons Learned

LessonDetails
Start simpleBoth Instagram and Netflix started with monoliths. Don't prematurely optimize for scale you don't have.
Scale the bottleneckProfile first. If 90% of load is database reads, add read replicas and caching before adding more app servers.
Statelessness firstMaking your application tier stateless is the highest-leverage scaling move. It unlocks horizontal scaling, auto-scaling, and rolling deploys.
Vertical before horizontalVertical scaling is simpler and cheaper at moderate scale. Instagram ran a single PostgreSQL instance for millions of users.
Automate everythingManual scaling doesn't work at internet scale. Invest in auto-scaling, CI/CD, and infrastructure-as-code early.
Design for failureAt scale, hardware failure is not an exception — it's a constant. Netflix designed for it with Chaos Monkey; Instagram relied on AWS managed failover.

Summary

Scaling is the fundamental challenge of system design. Here's the decision framework:

  1. Vertical first — it's simple, cheap, and gets you far. Most apps never outgrow a well-provisioned single server with a read replica.
  2. Make services stateless — this is the prerequisite for everything else. Extract state to Redis/databases.
  3. Horizontal for compute — stateless app servers behind a load balancer with auto-scaling.
  4. Cache aggressively — reduce database load by 80–95% before thinking about database scaling.
  5. Shard data as a last resort — sharding introduces permanent complexity. Exhaust all other options first.
  6. Use the Scale Cube — think about X (clone), Y (decompose), and Z (partition) as complementary strategies.

In the next post, we'll dive deep into Load Balancing — the critical piece that makes horizontal scaling work.