Scaling: Vertical vs Horizontal
Why Scaling Matters
Every successful application starts small — a single server, a modest database, a handful of users. But success brings traffic. And traffic, if left unmanaged, breaks things. Scaling is the engineering discipline of ensuring your system can handle growing load without degrading the user experience.
Growth Scenarios
Consider these real-world triggers that demand scaling decisions:
- Organic growth: Your user base doubles every quarter. What handles 10,000 daily active users today needs to serve 80,000 within a year.
- Traffic spikes: A product launch, a viral tweet, a Black Friday sale — sudden 10–50× surges that can bring down an unprepared system in minutes.
- Data growth: A social media platform generates 500 TB of new data per day. Storage and query performance can't remain static.
- Geographic expansion: Serving users across continents introduces latency that a single-region deployment can't solve.
What Happens When a Single Server Can't Handle Load?
When a single server is overwhelmed, the failure cascade is predictable and devastating:
- Response times spike. A typical API call that takes 50ms under normal load suddenly takes 2–5 seconds as the CPU saturates and the request queue backs up.
- Timeouts and errors. Clients start seeing HTTP 503 (Service Unavailable) and 504 (Gateway Timeout) errors. Connection pools exhaust.
- Cascading failures. Backed-up requests consume memory. The OS starts swapping to disk. Database connections time out, causing the application layer to retry — amplifying the load further.
- Complete outage. The server becomes unresponsive. The OOM killer terminates processes. Users see blank pages.
How Startups Evolve: From Single Server to Distributed
Most successful startups follow a remarkably similar scaling trajectory:
- Phase 1 — Single box (0–1K users): Everything runs on one machine — app server, database, file storage. A $20/month VPS handles it fine. This is the right choice: shipping fast matters more than architecture.
- Phase 2 — Separate the database (1K–10K users): The database gets its own server. Now the app server has more CPU and RAM for handling requests, and the DB server has dedicated I/O bandwidth.
- Phase 3 — Add a load balancer (10K–100K users): A second app server sits behind a load balancer. Read replicas appear for the database. A Redis cache reduces DB hits by 70–80%.
- Phase 4 — Service decomposition (100K–1M users): The monolith splits into services. Background jobs move to worker queues. CDN handles static assets. Auto-scaling policies manage compute.
- Phase 5 — Full distributed system (1M+ users): Database sharding, multi-region deployment, event-driven architecture, dedicated teams per service. This is Instagram, Uber, Netflix territory.
Vertical Scaling (Scale Up)
Vertical scaling means upgrading the hardware of your existing machine — more CPU cores, more RAM, faster SSDs, better network cards. You're making one server more powerful rather than adding more servers.
How It Works
In practice, vertical scaling is the simplest scaling strategy:
- Your PostgreSQL database is struggling with query latency? Upgrade from 16 GB RAM to 64 GB — now the entire working set fits in memory.
- CPU-bound workloads maxing out 4 cores? Move to a 16-core machine.
- Disk I/O bottleneck? Replace HDD with NVMe SSDs that deliver 500K+ IOPS instead of 200.
Advantages
| Advantage | Why It Matters |
|---|---|
| Zero code changes | Your application doesn't need to know it's running on a bigger machine. No distributed system complexity. |
| No data partitioning | All data lives on one machine. No sharding logic, no cross-node joins, no distributed transactions. |
| ACID simplicity | Single-node databases give you strong consistency for free. No CAP theorem headaches. |
| Lower operational cost | One server to monitor, backup, patch, and secure. One point of failure, but also one point of management. |
| Inter-process communication | Function calls instead of network calls. Nanoseconds instead of milliseconds. |
Limitations
- Hardware ceiling: The largest cloud instances cap out. AWS's
u-24tb1.112xlargehas 448 vCPUs and 24 TB RAM — impressive, but finite. You can't add a 449th vCPU. - Single point of failure: If that one powerful server goes down, everything goes down. No redundancy by default.
- Exponential cost: Doubling resources doesn't just double the price. An
r6g.16xlarge(512 GB RAM) costs roughly 4× more than twor6g.8xlargeinstances (256 GB RAM each) combined. - Downtime during upgrades: Resizing a server typically requires a reboot — minutes of downtime per upgrade event.
- Diminishing returns: Beyond a certain point, the bottleneck shifts. Doubling RAM won't help if the problem is CPU-bound.
Real-World AWS Instance Progression
Here's a concrete scaling path using AWS EC2:
| Instance | vCPUs | RAM | ~Cost/month | Use Case |
|---|---|---|---|---|
t3.micro | 2 | 1 GB | $8 | Dev/test, personal projects |
t3.large | 2 | 8 GB | $60 | Small production apps |
m6i.xlarge | 4 | 16 GB | $140 | Medium traffic web apps |
m6i.4xlarge | 16 | 64 GB | $560 | Databases, heavy compute |
r6i.16xlarge | 64 | 512 GB | $3,200 | Large in-memory workloads |
x2idn.24xlarge | 96 | 1.5 TB | $14,000 | SAP HANA, massive databases |
u-24tb1.112xlarge | 448 | 24 TB | $200,000+ | The ceiling — you've run out of vertical |
Horizontal Scaling (Scale Out)
Horizontal scaling means adding more machines to your pool and distributing the workload across them. Instead of one powerful server, you run many smaller servers behind a load balancer.
How It Works
The fundamental idea is simple: if one server handles 1,000 requests per second (RPS), ten identical servers behind a load balancer handle ~10,000 RPS. In practice, it's more nuanced, but this linear-ish relationship is the core value proposition.
┌─────────────┐
│ Clients │
└──────┬──────┘
│
┌──────▼──────┐
│ Load Balancer│
└──┬───┬───┬──┘
│ │ │
┌────────┘ │ └────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Server 1 │ │ Server 2 │ │ Server 3 │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└────────┬───┴────────────┘
│
┌──────▼──────┐
│ Database │
└─────────────┘
Advantages
| Advantage | Why It Matters |
|---|---|
| No single point of failure | If one server dies, the load balancer routes traffic to healthy servers. Users never notice. |
| Linear cost scaling | Need 2× capacity? Add 2× servers. Cost scales proportionally, not exponentially. |
| Theoretically unlimited | There's no hardware ceiling — you can always add more machines to the pool. |
| Rolling deployments | Update servers one at a time. Zero downtime deployments become possible. |
| Geographic distribution | Place servers in multiple regions to reduce latency for global users. |
Challenges
- State management: If Server 1 stores a user's session in local memory and the next request goes to Server 2, the session is lost. This is the fundamental challenge of horizontal scaling.
- Data consistency: With multiple servers reading and writing shared data, you face race conditions, stale reads, and the need for distributed locking or eventual consistency models.
- Network overhead: What was a local function call becomes a network round-trip (1–5ms within a datacenter, 50–150ms cross-region). This adds up fast.
- Operational complexity: Monitoring, deploying, debugging, and securing 50 servers is fundamentally harder than managing one.
- Data partitioning: At scale, your database itself needs to be split across machines (sharding), introducing significant complexity.
Session Management Strategies
When you scale horizontally, the question of "where does user state live?" becomes critical. Here are the main approaches:
1. Sticky Sessions (Session Affinity)
The load balancer remembers which server handles each user and always routes that user's requests to the same server. Typically implemented via cookies or IP hashing.
- Pros: Simple to implement, works with stateful apps without code changes.
- Cons: Uneven load distribution (one server might have all the heavy users), server failure loses all its sessions, can't auto-scale gracefully.
2. External Session Store (Recommended)
Store sessions in a shared external store (Redis, Memcached, or a database) that all servers can access. Each server is stateless — any server can handle any request.
// Express.js with Redis session store
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');
const redisClient = createClient({ url: 'redis://session-store:6379' });
await redisClient.connect();
app.use(session({
store: new RedisStore({ client: redisClient }),
secret: process.env.SESSION_SECRET,
resave: false,
saveUninitialized: false,
cookie: { maxAge: 3600000 } // 1 hour
}));
3. Client-Side Sessions (JWT)
Encode session data into a signed token (JWT) stored in the client's browser. The server is completely stateless — it just verifies the token signature.
- Pros: Zero server-side storage, scales infinitely.
- Cons: Can't revoke tokens without a blacklist (which reintroduces server state), token size limits, sensitive data exposure risk.
▶ Scaling Visualization
Step through to see vertical and horizontal scaling in action.
Stateless vs Stateful Architecture
The distinction between stateless and stateful services is perhaps the most important concept in horizontal scaling. It determines how easily your system can grow.
The Problem with Stateful Services
A stateful service stores client-specific data in local memory between requests. A classic example: a web server that keeps user sessions in a local HashMap.
// Stateful — session stored in-process memory
const sessions = {}; // local to THIS server
app.post('/login', (req, res) => {
const sessionId = generateId();
sessions[sessionId] = { userId: req.body.userId, cart: [] };
res.cookie('sid', sessionId);
res.json({ ok: true });
});
app.get('/cart', (req, res) => {
const session = sessions[req.cookies.sid]; // only works on THIS server!
if (!session) return res.status(401).json({ error: 'Session not found' });
res.json(session.cart);
});
This breaks the moment you add a second server. The user logs in on Server 1 (which stores their session), but the load balancer sends their next request to Server 2 (which has no knowledge of that session). Result: 401 Unauthorized.
Designing for Statelessness
A stateless service treats each request as independent. It carries no memory of previous interactions. All necessary state is either:
- Included in the request itself (JWT tokens, request parameters).
- Stored in an external shared system (Redis, database, S3).
// Stateless — session stored in Redis (shared across all servers)
const redis = require('redis').createClient({ url: 'redis://cache:6379' });
app.post('/login', async (req, res) => {
const sessionId = generateId();
await redis.hSet(`session:${sessionId}`, {
userId: req.body.userId,
cart: JSON.stringify([])
});
await redis.expire(`session:${sessionId}`, 3600);
res.cookie('sid', sessionId);
res.json({ ok: true });
});
app.get('/cart', async (req, res) => {
const session = await redis.hGetAll(`session:${req.cookies.sid}`);
if (!session.userId) return res.status(401).json({ error: 'Session expired' });
res.json(JSON.parse(session.cart));
});
Now any server in the pool can handle any request. The load balancer has complete freedom. Servers can be added, removed, or replaced without any coordination.
Shared-Nothing Architecture
The gold standard for horizontally scalable systems is shared-nothing architecture: each node is fully independent and self-sufficient. Nodes don't share memory, disk, or any other resource. They communicate only through well-defined network APIs.
This architecture gives you:
- Fault isolation: A crash in one node doesn't affect others.
- Linear scalability: Adding a node adds proportional capacity.
- Independent deployment: Update nodes one at a time.
Google's original infrastructure, Amazon's Dynamo, and Cassandra all follow shared-nothing principles.
Database Scaling
While application servers are relatively easy to scale horizontally (make them stateless, add more behind a load balancer), databases are the real scaling bottleneck. They hold state by definition, and distributing state is hard.
Read Replicas
Most applications are read-heavy (80–95% reads). Read replicas let you distribute read traffic across multiple copies of your database while sending all writes to a single primary.
┌───────────────┐
Writes ──────►│ Primary (RW) │
└───────┬───────┘
│ │ │ Replication
┌─────┘ │ └─────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
Reads│ Replica 1│ │ Replica 2│ │ Replica 3│
└──────────┘ └──────────┘ └──────────┘
- Replication lag: Replicas can be milliseconds to seconds behind the primary. This is fine for displaying a news feed, but not for showing a user the item they just added to their cart. Route read-after-write scenarios to the primary.
- AWS RDS supports up to 15 read replicas for MySQL/PostgreSQL, with automatic failover promotion.
Vertical Scaling for Writes
Write scaling is harder. In a single-primary setup, all writes funnel through one node. Your options:
- Upgrade the primary to a bigger instance (vertical scaling).
- Optimize write paths — batch inserts, reduce indexes, use append-only tables.
- Move to a multi-primary setup (complex, conflict resolution needed).
- Shard the database (next section).
Sharding (Horizontal Partitioning)
Sharding splits your data across multiple database servers, each holding a subset. Each shard is a fully independent database instance.
| Strategy | How It Works | Trade-off |
|---|---|---|
| Range-based | Users A–M on Shard 1, N–Z on Shard 2 | Uneven distribution (some ranges are hotter) |
| Hash-based | shard = hash(userId) % num_shards | Even distribution, but resharding is painful |
| Directory-based | A lookup service maps each key to its shard | Flexible, but the directory is a bottleneck/SPOF |
| Consistent hashing | Hash ring minimizes redistribution on shard changes | Best for dynamic scaling; more complex to implement |
We'll cover sharding in deep detail in a dedicated post later in this series.
Connection Pooling
Each database connection consumes ~10 MB of memory on the server side. With 50 app servers each opening 20 connections, that's 1,000 connections × 10 MB = 10 GB just for connections. Connection poolers like PgBouncer (for PostgreSQL) or ProxySQL (for MySQL) sit between your app and database, multiplexing many app connections over fewer database connections.
# PgBouncer config snippet
[databases]
mydb = host=primary.db.internal port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction # connections returned after each transaction
max_client_conn = 2000 # accept up to 2000 app connections
default_pool_size = 50 # but only use 50 actual DB connections
Caching Layer
The fastest database query is the one you never make. A caching layer (typically Redis or Memcached) can absorb 80–95% of read traffic, dramatically reducing database load.
- Cache-aside pattern: App checks cache first. On miss, reads from DB, stores in cache, returns. On write, invalidate the cache entry.
- Cache hit ratio: If 90% of reads hit cache and your cache serves 100K RPS, only 10K RPS reach the database. This is the difference between needing a massive DB cluster and a modest one.
- TTL strategy: Set expiration times based on data freshness requirements. User profiles: 5 min. Product catalog: 1 hour. Configuration: 24 hours.
Auto-Scaling
Manual scaling doesn't work at internet scale. You need systems that automatically add or remove capacity based on real-time demand. This is auto-scaling — letting infrastructure respond to load as it happens.
Reactive vs Predictive Auto-Scaling
| Type | How It Works | Best For |
|---|---|---|
| Reactive | Monitors metrics in real-time. When CPU exceeds 70% for 5 minutes, add 2 instances. | Unpredictable traffic patterns, general-purpose scaling |
| Predictive | Uses ML on historical data to predict load. Pre-scales 15 minutes before the expected spike. | Predictable patterns (daily peaks, weekly cycles) |
| Scheduled | Manually configured time-based rules. Scale up at 8 AM, scale down at 10 PM. | Known events (business hours, batch jobs) |
Scaling Policies
AWS Auto Scaling supports several policy types:
- Target Tracking: "Keep average CPU at 60%." AWS automatically adjusts instance count. This is the simplest and most recommended approach.
- Step Scaling: "If CPU > 70%, add 2 instances. If CPU > 90%, add 5 instances." Graduated response to different load levels.
- Simple Scaling: "If CPU > 70%, add 1 instance." Waits for cool-down before any further action. Simplest but slowest to respond.
- Custom Metrics: Scale based on application-specific metrics like queue depth, request latency, or active WebSocket connections.
Cool-Down Periods
Without cool-down, auto-scaling oscillates rapidly:
- CPU hits 80% → add 2 instances.
- New instances haven't warmed up yet → CPU still high → add 2 more.
- All instances warm up → CPU drops to 20% → remove 3 instances.
- Remaining instances overwhelmed → CPU spikes → add more instances.
Cool-down periods (typically 300 seconds) prevent this thrashing by blocking scaling actions after a recent change, giving the system time to stabilize.
AWS Auto Scaling Group Example
# CloudFormation snippet for an Auto Scaling Group
MyASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: 2 # Never fewer than 2 (high availability)
MaxSize: 20 # Cost guardrail
DesiredCapacity: 4 # Start with 4
HealthCheckType: ELB
HealthCheckGracePeriod: 120
LaunchTemplate:
LaunchTemplateId: !Ref MyLaunchTemplate
Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
TargetGroupARNs:
- !Ref MyTargetGroup
CPUScalingPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref MyASG
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 60.0 # Keep CPU around 60%
ScaleInCooldown: 300
ScaleOutCooldown: 120
Cost Optimization with Spot Instances
AWS Spot Instances offer up to 90% discount over On-Demand pricing, but can be reclaimed with 2-minute notice. Smart auto-scaling strategies blend instance types:
- Base capacity: 2–4 On-Demand instances (always available, minimum viable capacity).
- Burst capacity: Spot instances for the remaining capacity. Use multiple instance types (m5.xlarge, m5a.xlarge, m6i.xlarge) to reduce interruption risk.
- Fallback: If Spot isn't available, fall back to On-Demand. AWS Mixed Instances Policy handles this automatically.
A typical SaaS application spending $50K/month on EC2 can reduce this to ~$20K with an aggressive Spot strategy.
Scaling Patterns: The Scale Cube
In 2009, Martin Abbott and Michael Fisher introduced the Scale Cube — a framework that categorizes all scaling approaches into three axes. Every scaling decision you make falls along one (or more) of these axes.
X-Axis: Horizontal Cloning
Run N identical copies of your application behind a load balancer. Each copy can handle any request. This is the most common and simplest scaling approach.
- Example: 10 identical web servers behind an NGINX load balancer.
- Pros: Simple, no code changes, works for most apps.
- Cons: Every server needs access to all data. Doesn't help with data scaling.
- Scales: Compute capacity (CPU, memory).
Y-Axis: Functional Decomposition
Split your monolith into smaller services, each responsible for a specific function. This is the microservices approach.
- Example: Separate services for User Auth, Product Catalog, Order Processing, Payments, and Notifications.
- Pros: Each service scales independently. The payment service can scale to 50 instances during checkout surges while the auth service stays at 5.
- Cons: Network complexity, distributed transactions, deployment orchestration.
- Scales: Development velocity (independent teams), operational flexibility.
Z-Axis: Data Partitioning (Sharding)
Each server runs the same code but is responsible for only a subset of data. A router sends each request to the correct shard based on a partition key.
- Example: User data split across 16 shards by
hash(userId) % 16. User #12345 always goes to shard 7. - Pros: Scales data storage and write throughput. Each shard's dataset fits in memory.
- Cons: Cross-shard queries are expensive. Rebalancing shards is painful. Application needs routing logic.
- Scales: Data volume, write throughput.
| Axis | What It Scales | Mechanism | Complexity |
|---|---|---|---|
| X (Clone) | Request throughput | Identical replicas + load balancer | Low |
| Y (Decompose) | Feature complexity | Microservices | Medium-High |
| Z (Partition) | Data volume & writes | Sharding by partition key | High |
▶ The Scale Cube
Step through each scaling axis: X (clone), Y (decompose), Z (partition).
Real-World Case Studies
Instagram: From 1 to 400M Users
When Instagram launched in October 2010, it had 2 engineers and a single server. Within 2 years, it had 100 million users. Here's how they scaled:
- Phase 1 (Launch): 1 Django app server + 1 PostgreSQL database on a single machine at a hosting provider. Total cost: ~$100/month.
- Phase 2 (1M users, 2 months in): Migrated to AWS. 3 NGINX + Django app servers behind an ELB. PostgreSQL on a dedicated
db.m1.xlarge(vertical scaling). Added Redis for caching and Memcached for view counts. - Phase 3 (10M users): Read replicas for PostgreSQL. Moved media uploads to S3 + CloudFront CDN. Task queue (Celery + RabbitMQ) for background jobs like push notifications and feed fan-out.
- Phase 4 (100M+ users): Horizontal PostgreSQL sharding using a custom shard-aware ORM. 12 PostgreSQL shards, each on an
r3.8xlarge. Cassandra for feed storage. ~25 Django app servers. - Key principle: Instagram kept their stack simple (Python/Django, PostgreSQL, Redis, Memcached) and scaled operations rather than chasing trendy technologies. They famously did not adopt microservices until years after the Facebook acquisition.
Netflix: Monolith to Microservices Migration
Netflix's migration from a single-datacenter monolith to a cloud-native microservices architecture is the canonical scaling story:
- 2007: Netflix ran a monolithic Java application on its own hardware. A database corruption event caused 3 days of downtime — the catalyst for change.
- 2008–2009: Began migrating to AWS. Started with non-critical services (encoding, analytics) to build confidence.
- 2010–2011: Decomposed the monolith into microservices. Each service owned its own data store. Built custom infrastructure tools: Eureka (service discovery), Zuul (API gateway), Hystrix (circuit breaker).
- 2012: Completed the migration. 100% of streaming traffic served from AWS. Shut down the last datacenter.
- Today: 700+ microservices, serving 200+ million subscribers across 190 countries. Each service is independently deployed (thousands of deployments per day). They handle peak traffic of ~400 Gbps during evening hours.
Key scaling decisions:
- Y-axis scaling (microservices) for development velocity — 2,000+ engineers working independently.
- X-axis scaling (auto-scaling groups) for handling traffic peaks.
- Z-axis scaling (Cassandra with consistent hashing) for data volume.
- Chaos engineering (Chaos Monkey, Chaos Kong) to ensure resilience at scale.
- Multi-region deployment (us-east-1, eu-west-1, ap-southeast-2) for latency and disaster recovery.
Lessons Learned
| Lesson | Details |
|---|---|
| Start simple | Both Instagram and Netflix started with monoliths. Don't prematurely optimize for scale you don't have. |
| Scale the bottleneck | Profile first. If 90% of load is database reads, add read replicas and caching before adding more app servers. |
| Statelessness first | Making your application tier stateless is the highest-leverage scaling move. It unlocks horizontal scaling, auto-scaling, and rolling deploys. |
| Vertical before horizontal | Vertical scaling is simpler and cheaper at moderate scale. Instagram ran a single PostgreSQL instance for millions of users. |
| Automate everything | Manual scaling doesn't work at internet scale. Invest in auto-scaling, CI/CD, and infrastructure-as-code early. |
| Design for failure | At scale, hardware failure is not an exception — it's a constant. Netflix designed for it with Chaos Monkey; Instagram relied on AWS managed failover. |
Summary
Scaling is the fundamental challenge of system design. Here's the decision framework:
- Vertical first — it's simple, cheap, and gets you far. Most apps never outgrow a well-provisioned single server with a read replica.
- Make services stateless — this is the prerequisite for everything else. Extract state to Redis/databases.
- Horizontal for compute — stateless app servers behind a load balancer with auto-scaling.
- Cache aggressively — reduce database load by 80–95% before thinking about database scaling.
- Shard data as a last resort — sharding introduces permanent complexity. Exhaust all other options first.
- Use the Scale Cube — think about X (clone), Y (decompose), and Z (partition) as complementary strategies.
In the next post, we'll dive deep into Load Balancing — the critical piece that makes horizontal scaling work.