Case Study: Slack’s Architecture
Slack at Scale
Slack is the workplace messaging platform used by over 20 million daily active users across 750,000+ organizations. At peak, Slack handles millions of WebSocket connections simultaneously, delivers messages in under 500 ms globally, and stores trillions of messages across enterprise workspaces. Every second, hundreds of thousands of messages flow through its infrastructure — each triggering writes, notifications, indexing, and real-time delivery.
What makes Slack architecturally fascinating is the sheer number of hard problems it solves at once:
- Real-time delivery — messages must appear for all channel members within milliseconds, not seconds.
- Enterprise-grade reliability — banks, governments, and Fortune 500 companies depend on Slack for mission-critical communication. SLA targets of 99.99% uptime are expected.
- Multi-tenancy at extreme scale — workspaces range from 5-person startups to 500,000-person enterprises, all sharing infrastructure.
- Search over trillions of messages — users expect sub-second full-text search across years of conversation history.
- Compliance and security — data retention policies, encryption key management, audit logs, and compliance exports for regulated industries.
Tech Stack Evolution
Slack’s technology stack has undergone significant evolution since its founding in 2013. Understanding this journey illuminates why certain architectural decisions were made.
The Early Days: PHP & Hack
Slack was originally built as a monolithic PHP application — a pragmatic choice for rapid prototyping during its origin as an internal tool at Tiny Speck (the game company that pivoted to create Slack). The initial stack:
Early Slack Stack (2013–2016):
├── Language: PHP (later migrated to Hack/HHVM)
├── Web Server: Apache → Nginx + PHP-FPM
├── Database: MySQL (single primary, multiple replicas)
├── Cache: Memcached (session state, query results)
├── Queue: Redis-based job queue
├── Search: Apache Solr
├── Real-time: Custom WebSocket server (Node.js)
├── File Storage: Amazon S3
└── CDN: CloudFront for static assets and file delivery
PHP served Slack well to ~1M DAU, but several pain points emerged:
- Monolith coupling — a change in message sending could break notification delivery, search indexing, or file previews.
- Type safety — dynamic typing in PHP made refactoring risky at scale, leading to adoption of Hack (Facebook’s typed PHP superset running on HHVM).
- Performance ceilings — the request-per-process model couldn’t efficiently handle long-lived WebSocket connections or CPU-intensive operations like message rendering.
- Database bottlenecks — a single MySQL primary became the choke point well before 1M workspaces.
The Modern Stack: Java Services
Starting around 2017, Slack began decomposing the monolith into Java-based microservices. The migration was incremental — the PHP monolith still handles some traffic today, but critical paths run on the Java service mesh:
Modern Slack Stack (2020+):
├── API Layer
│ ├── Hack (HHVM): Legacy API endpoints, web app rendering
│ └── Java (JDK 17+): New API services, high-throughput paths
│
├── Service Mesh
│ ├── gRPC for inter-service communication
│ ├── Envoy sidecar proxies for load balancing, observability
│ └── Service discovery via Consul
│
├── Data Layer
│ ├── MySQL: Primary data store (messages, channels, users)
│ ├── Vitess: MySQL sharding proxy (workspace-level sharding)
│ ├── Memcached: Hot data caching (billions of gets/day)
│ └── Redis: Ephemeral state, presence, rate limiting
│
├── Async Processing
│ ├── Kafka: Event streaming backbone
│ ├── Custom Job Queue: Async task execution
│ └── Redis Streams: Lightweight pub/sub for presence
│
├── Search & Analytics
│ ├── Solr/Elasticsearch: Message search (per-workspace indexes)
│ └── Presto + Hive: Analytics, data warehouse
│
├── Real-time
│ ├── WebSocket Gateway (Java): Persistent connections
│ ├── Channel-based pub/sub fan-out
│ └── Flannel: Edge caching layer
│
└── Infrastructure
├── AWS (primary cloud)
├── Terraform for infrastructure-as-code
├── Kubernetes for container orchestration
└── Datadog + PagerDuty for monitoring/alerting
Vitess for MySQL Sharding
Vitess is the cornerstone of Slack’s data tier. Originally developed at YouTube and later donated to CNCF, Vitess provides horizontal sharding for MySQL while preserving the MySQL wire protocol — applications connect to Vitess as if it were a regular MySQL server.
Why Slack Chose Vitess
By 2016, Slack’s single-primary MySQL setup was hitting hard limits. The options were:
| Option | Pros | Cons |
|---|---|---|
| Vertical scaling | No code changes | Hard ceiling, exponential cost |
| NoSQL migration | Built-in sharding | Massive rewrite, lose ACID, risky at scale |
| Application-level sharding | Full control | Every service needs shard-aware code |
| Vitess | MySQL compatible, proven at YouTube, transparent sharding | Operational complexity, learning curve |
Slack chose Vitess because it allowed them to shard MySQL horizontally without rewriting application code. Existing MySQL queries continued to work — Vitess transparently routes them to the correct shard.
Vitess Architecture at Slack
┌─────────────────────────┐
│ Application │
│ (Java / Hack services) │
└────────────┬──────────────┘
│ MySQL protocol
▼
┌─────────────────────────┐
│ VTGate │
│ (Stateless proxy) │
│ • SQL parsing │
│ • Query routing │
│ • Scatter-gather │
│ • Connection pooling │
└────────────┬──────────────┘
│ gRPC
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ VTTablet │ │ VTTablet │ │ VTTablet │
│ (Shard 0) │ │ (Shard 1) │ │ (Shard N) │
│ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ MySQL │ │ │ │ MySQL │ │ │ │ MySQL │ │
│ │ Primary + │ │ │ │ Primary +│ │ │ │ Primary +│ │
│ │ Replicas │ │ │ │ Replicas │ │ │ │ Replicas │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
└──────────────┘ └──────────────┘ └──────────────┘
Topology Server (etcd): Stores shard map, tablet health,
serving state, and schema versions for all keyspaces.
Key Vitess Components
- VTGate — Stateless proxy that receives MySQL connections from applications. It parses SQL, resolves the target shard(s) using the VSchema (Vitess schema mapping), and forwards queries via gRPC to the appropriate VTTablet. For cross-shard queries, VTGate performs scatter-gather: it sends the query to all relevant shards and merges results.
- VTTablet — Runs alongside each MySQL instance. It manages the MySQL process, enforces query restrictions (e.g., blocking full table scans), handles connection pooling to MySQL, and reports health to the topology server. Each tablet knows if it’s a primary or replica and its serving state.
- Topology Server (etcd) — Stores the global cluster state: which shards exist, which tablets serve which shards, their health status, and the VSchema. VTGate reads this at startup and watches for changes.
- VSchema — A JSON configuration that defines the sharding strategy for each table. It tells Vitess how to map rows to shards.
Workspace-Level Sharding
Slack shards primarily by workspace ID. This is the natural partition boundary because:
- Most queries are workspace-scoped (messages in a channel, members of a workspace, search within a workspace).
- Cross-workspace queries are rare (shared channels are handled separately).
- Workspace-level sharding provides strong data locality — all data for a workspace lives on the same shard.
// Vitess VSchema — workspace-based sharding
{
"sharded": true,
"vindexes": {
"workspace_hash": {
"type": "xxhash" // Consistent hashing on workspace_id
}
},
"tables": {
"messages": {
"column_vindexes": [
{ "column": "workspace_id", "name": "workspace_hash" }
]
},
"channels": {
"column_vindexes": [
{ "column": "workspace_id", "name": "workspace_hash" }
]
},
"users_workspace": {
"column_vindexes": [
{ "column": "workspace_id", "name": "workspace_hash" }
]
},
"files_metadata": {
"column_vindexes": [
{ "column": "workspace_id", "name": "workspace_hash" }
]
}
}
}
// Query routing example:
// App sends: SELECT * FROM messages WHERE workspace_id = 'T12345' AND channel_id = 'C98765'
// VTGate:
// 1. Extracts workspace_id = 'T12345'
// 2. Computes xxhash('T12345') → shard range 40-60
// 3. Routes to VTTablet for shard 40-60
// 4. VTTablet executes query on local MySQL
// 5. Returns result to VTGate → Application
Online Resharding
One of Vitess’s killer features is online resharding — splitting or merging shards without downtime. When a shard becomes hot (e.g., a massive enterprise workspace dominates a shard), Slack can split it:
Resharding Process (splitting shard 40-60 → 40-50, 50-60):
Step 1: Create new VTTablets for shards 40-50 and 50-60
Step 2: Start VReplication — stream row changes from old shard to new shards
(VReplication filters rows by their vindex hash value)
Step 3: Catch-up phase — new shards lag behind by seconds, then milliseconds
Step 4: Cut-over:
a. Stop writes to old shard (brief ~100ms pause)
b. Verify new shards are fully caught up
c. Update topology server: new shards are now serving
d. Resume writes — now routed to new shards
Step 5: Old shard decommissioned after verification period
Total downtime: <1 second (during cut-over)
Application awareness: Zero — VTGate handles the routing change transparently
Job Queue System
Not everything in Slack happens synchronously. Sending a message triggers a cascade of async work — and Slack’s job queue system is the backbone of this asynchronous processing.
Async Task Categories
When a message is sent, the API server writes it to MySQL and returns immediately. Everything else is async:
Message Post → API Server (synchronous: ~50ms)
│
├── Enqueue: "index_message" (Search indexing)
├── Enqueue: "send_push_notifications" (iOS/Android push)
├── Enqueue: "send_email_notification" (for offline users)
├── Enqueue: "execute_workflows" (Workflow Builder triggers)
├── Enqueue: "update_unread_counts" (Badge counts)
├── Enqueue: "run_app_event_subscriptions" (Bot events)
├── Enqueue: "generate_link_previews" (URL unfurling)
├── Enqueue: "process_file_attachments" (Thumbnails, virus scan)
└── Enqueue: "update_analytics" (Message volume metrics)
Queue Architecture
Slack’s job queue evolved from a simple Redis-backed queue to a more sophisticated system:
Job Queue Architecture:
┌─────────────┐ ┌───────────────────────────────────────┐
│ Producers │ │ Kafka Topic │
│ (API servers)│───▶│ "async-jobs" (partitioned by type) │
└─────────────┘ └───────────────┬───────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Worker Pool│ │ Worker Pool│ │ Worker Pool│
│ "search" │ │ "notify" │ │ "webhooks" │
│ 200 workers│ │ 150 workers│ │ 100 workers│
└────────────┘ └────────────┘ └────────────┘
Job Priorities:
P0 (Critical): Push notifications, real-time events — <1s
P1 (High): Search indexing, unread counts — <5s
P2 (Normal): Link previews, file processing — <30s
P3 (Low): Analytics, compliance logging — <5min
Retry Policy:
• Exponential backoff: 1s → 2s → 4s → 8s → ... → max 5min
• Max retries: 5 (P0), 10 (P1), 20 (P2/P3)
• Dead letter queue for permanently failed jobs
• Idempotency keys prevent duplicate execution
Delivery Guarantees
Slack’s job queue provides at-least-once delivery with idempotency:
- At-least-once — Jobs are retried on failure, ensuring no work is silently dropped. Workers must be idempotent because the same job may execute more than once.
- Idempotency keys — Each job carries a unique key (e.g.,
msg:{message_ts}:index). Workers check this key before executing. If the key is already marked complete, the duplicate job is discarded. - Poison pill protection — Jobs that crash workers repeatedly are automatically routed to the dead letter queue after max retries, preventing a single bad message from blocking an entire worker pool.
- Back-pressure — When worker pools are saturated, Kafka consumer lag increases. Monitoring triggers auto-scaling of worker instances and alerts the on-call team if lag exceeds thresholds.
Real-Time Messaging
Real-time message delivery is Slack’s core product experience. When Alice sends a message in #engineering, every member of that channel must see it appear within milliseconds — even if there are 10,000 members distributed across 50 countries.
WebSocket Connection Layer
Each Slack client (desktop, mobile, browser) maintains a persistent WebSocket connection to Slack’s WebSocket Gateway:
WebSocket Gateway Architecture:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Client App │ │ Client App │ │ Client App │
│ (Desktop) │ │ (Mobile) │ │ (Browser) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ WSS │ WSS │ WSS
▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ Load Balancer (L4/L7) │
│ (HAProxy / AWS NLB) │
│ • Sticky sessions by connection ID │
│ • Health check WebSocket gateways │
│ • TLS termination │
└──────────────────────┬───────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌────────────┐┌────────────┐┌────────────┐
│ WS Gateway ││ WS Gateway ││ WS Gateway │
│ Server 1 ││ Server 2 ││ Server N │
│ ││ ││ │
│ ~100K conn ││ ~100K conn ││ ~100K conn │
│ ││ ││ │
│ In-memory: ││ In-memory: ││ In-memory: │
│ user→conn ││ user→conn ││ user→conn │
│ channel→ ││ channel→ ││ channel→ │
│ users ││ users ││ users │
└─────┬──────┘└─────┬──────┘└─────┬──────┘
│ │ │
└─────────────┼─────────────┘
│ Subscribe
▼
┌──────────────────┐
│ Message Bus │
│ (Redis Pub/Sub │
│ or Kafka) │
└──────────────────┘
Channel-Based Pub/Sub Fan-Out
When a message is posted to a channel, the fan-out process delivers it to every connected member:
Fan-out Process:
1. API server writes message to MySQL (via Vitess)
2. API server publishes event to Message Bus:
{
"type": "message",
"channel": "C98765",
"workspace": "T12345",
"user": "U111",
"text": "Deploying v2.3.1 to production",
"ts": "1714567890.000100"
}
3. All WS Gateway servers subscribed to workspace T12345 receive the event
4. Each WS Gateway:
a. Looks up local channel membership: "Which of MY connected users are in C98765?"
b. For each matching user, serializes the event and writes to their WebSocket
c. For users with multiple connected devices, sends to ALL connections
Fan-out complexity:
• Small channel (10 members, 1 gateway): 10 WebSocket writes
• Large channel (10,000 members, 50 gateways):
Each gateway sends to its local subset — average 200 writes each
Total: 10,000 WebSocket writes across 50 servers in parallel
Message Ordering Guarantees
Slack uses Lamport-style timestamps (ts field) for message ordering within a channel:
- The
ts(timestamp) is a unique, monotonically increasing value per channel — formatted asepoch.sequence(e.g.,1714567890.000100). - The API server holds a per-channel lock (via Redis) when generating timestamps to prevent conflicts.
- Clients sort messages by
ts— even if messages arrive out of order via WebSocket, the display order is correct. - Edits and deletions reference the original
ts, ensuring they modify the correct message.
Animation: Slack Message Flow
Flannel: Slack’s Edge Caching Layer
Flannel is Slack’s custom client-aware caching layer that sits between the API servers and the clients. It is one of Slack’s most innovative architectural components — purpose-built to solve the boot problem.
The Boot Problem
When a Slack client opens, it needs to load everything about the user’s workspace:
Client boot payload ("rtm.start" or "client.boot"):
├── Workspace metadata (name, icon, settings, plan)
├── User list (every member of the workspace: id, name, avatar, status)
├── Channel list (every channel: name, purpose, membership, unread counts)
├── IM/Group list (direct messages, multi-party DMs)
├── Emoji list (custom emoji: name → URL mappings)
├── User groups (handle groups, their members)
├── Saved items, reminders, preferences
└── Bot users and app configurations
For a 50,000-person workspace, this payload is 5–20 MB of JSON.
Without Flannel, every client boot required the API server to query dozens of MySQL tables, join and serialize massive result sets. At Slack’s scale, this created devastating load — especially after outages when every client reconnects simultaneously (the “thundering herd” problem).
Flannel Architecture
Flannel Architecture:
┌──────────────┐
│ Slack Client │
└──────┬───────┘
│ "Give me everything for workspace T12345"
▼
┌──────────────────────────────────────────────────┐
│ Flannel Server │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ In-Memory Cache (per workspace) │ │
│ │ │ │
│ │ T12345 → { │ │
│ │ users: [50,000 user objects], │ │
│ │ channels: [8,000 channel objects], │ │
│ │ emoji: [2,000 custom emoji], │ │
│ │ ... │ │
│ │ version: 4,827,301 │ │
│ │ } │ │
│ │ │ │
│ │ Cache hit? → Return instantly (~5ms) │ │
│ │ Cache miss? → Fetch from API, cache, return │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Change Stream Consumer (Kafka) │ │
│ │ │ │
│ │ Listens for workspace mutation events: │ │
│ │ • user_joined, user_left, profile_updated │ │
│ │ • channel_created, channel_archived │ │
│ │ • emoji_added, emoji_removed │ │
│ │ • preferences_changed │ │
│ │ │ │
│ │ On event → Apply incremental update to │ │
│ │ cached workspace state, bump version │ │
│ └─────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
Key Design Decisions:
1. Per-workspace caching: Each workspace's full state is a single cache entry
2. Incremental updates: Flannel never re-fetches the full workspace — it
applies events as diffs (user added → append to users list)
3. Versioned state: Every mutation bumps a version counter. Clients can
request "give me changes since version X" for efficient reconnection
4. Sharded by workspace: Flannel servers are assigned workspace ranges,
ensuring each workspace's state lives on exactly one Flannel server
Delta Updates and Reconnection
Flannel’s versioned state enables delta reconnection — when a client reconnects, it sends its last known version and receives only the changes:
// Client reconnects after brief disconnection:
GET /api/flannel.delta?workspace=T12345&since_version=4827290
// Flannel returns only the 11 changes since that version:
{
"version": 4827301,
"changes": [
{ "type": "user_status_changed", "user": "U555", "status": "🏖️ PTO" },
{ "type": "channel_created", "channel": { "id": "C99999", "name": "new-project" } },
{ "type": "user_joined_channel", "user": "U123", "channel": "C98765" },
// ... 8 more incremental changes
]
}
// Instead of re-downloading 15 MB, the client receives ~2 KB of deltas.
// This reduces reconnection load by 99.9% during thundering herd scenarios.
Search Infrastructure
Slack search allows users to find any message, file, or conversation across their workspace’s entire history. At scale, this means searching across trillions of messages with sub-second latency.
Search Stack
Slack uses a combination of Solr and Elasticsearch for its search infrastructure, with the stack evolving over time:
Search Architecture:
┌────────────┐ ┌──────────────────────────────────────┐
│ User types │ │ Search API Service │
│ query │───▶│ │
└────────────┘ │ 1. Parse query (operators, filters) │
│ 2. Resolve workspace shard │
│ 3. Build Solr/ES query DSL │
│ 4. Execute against workspace index │
│ 5. Re-rank results │
│ 6. Hydrate (load full messages) │
│ 7. Apply access control filtering │
│ 8. Return results │
└───────────────┬──────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Search Cluster (per workspace) │
│ │
│ Index per workspace: │
│ ┌───────────────────────────────┐ │
│ │ messages: { │ │
│ │ ts, text, user_id, │ │
│ │ channel_id, channel_type, │ │
│ │ has_file, has_link, │ │
│ │ reactions, thread_ts, │ │
│ │ workspace_id │ │
│ │ } │ │
│ │ files: { │ │
│ │ name, content_text, │ │
│ │ file_type, user_id, │ │
│ │ channel_id, upload_ts │ │
│ │ } │ │
│ └───────────────────────────────┘ │
└──────────────────────────────────────┘
Indexing Pipeline
Messages are indexed asynchronously via the job queue system. The indexing pipeline:
Indexing Pipeline:
1. Message written to MySQL (via Vitess)
2. "index_message" job enqueued to Kafka
3. Search Indexer Worker picks up job:
a. Fetches full message from MySQL (with context: channel name, user info)
b. Tokenizes text content
c. Extracts entities (mentions, links, emoji, code blocks)
d. Generates search document:
{
"ts": "1714567890.000100",
"text": "Deploying v2.3.1 to production @oncall",
"text_analyzed": ["deploy", "v2.3.1", "production", "oncall"],
"user_id": "U111",
"user_name": "alice",
"channel_id": "C98765",
"channel_name": "engineering",
"channel_type": "public",
"has_mention": true,
"mentioned_users": ["U222"],
"has_link": false,
"has_file": false,
"reactions": [],
"thread_ts": null,
"workspace_id": "T12345"
}
e. Sends document to Solr/ES for the workspace's index
f. Acknowledges job completion
Indexing Latency: P50 = 2s, P99 = 8s (messages become searchable within seconds)
Throughput: Hundreds of thousands of messages indexed per second globally
Access Control in Search
Search results must respect channel permissions. A user cannot find messages from private channels they’re not a member of:
- Public channels — searchable by anyone in the workspace.
- Private channels — results filtered to current members only. The search service queries the channel membership cache at query time.
- DMs/Group DMs — only participants can search.
- Shared channels — searchable by members from both workspaces, with each workspace’s compliance policies applied.
from:@alice, in:#engineering, has:link, before:2024-01-01, during:March. These are parsed by the Search API into structured filters applied at the Solr/ES level, avoiding full-text scans for filtered queries.
Channel & DM Architecture
Channels are the fundamental abstraction in Slack — every conversation happens in a channel, whether it’s a public team channel, a private group, or a direct message.
Channel Types
| Type | Prefix | Max Members | Visibility |
|---|---|---|---|
| Public Channel | C | Unlimited | Discoverable, joinable by anyone |
| Private Channel | G | Unlimited | Invite-only, hidden from non-members |
| DM (1:1) | D | 2 | Private between two users |
| Group DM (MPDM) | G | 9 | Private group conversation |
| Shared Channel | C | Unlimited | Spans multiple workspaces (Slack Connect) |
Message Data Model
-- Core message schema (simplified from Slack's actual schema)
CREATE TABLE messages (
workspace_id VARCHAR(12) NOT NULL, -- Shard key (Vitess vindex)
channel_id VARCHAR(12) NOT NULL,
ts VARCHAR(20) NOT NULL, -- "1714567890.000100" (unique per channel)
user_id VARCHAR(12) NOT NULL,
text TEXT,
thread_ts VARCHAR(20), -- Parent message ts (NULL if not in thread)
subtype VARCHAR(30), -- 'bot_message', 'file_share', etc.
edited_ts VARCHAR(20), -- Timestamp of last edit
is_deleted BOOLEAN DEFAULT FALSE,
reactions JSON, -- [{"name":"thumbsup","users":["U111","U222"]}]
files JSON, -- Attached file metadata
blocks JSON, -- Block Kit structured content
PRIMARY KEY (workspace_id, channel_id, ts)
) ENGINE=InnoDB;
-- Channel membership (denormalized for fast lookups)
CREATE TABLE channel_members (
workspace_id VARCHAR(12) NOT NULL,
channel_id VARCHAR(12) NOT NULL,
user_id VARCHAR(12) NOT NULL,
joined_at BIGINT NOT NULL,
last_read_ts VARCHAR(20), -- For unread tracking
is_muted BOOLEAN DEFAULT FALSE,
notification_pref VARCHAR(20) DEFAULT 'default',
PRIMARY KEY (workspace_id, channel_id, user_id),
INDEX idx_user_channels (workspace_id, user_id)
) ENGINE=InnoDB;
Thread Support
Threads are implemented as messages with a non-null thread_ts that references the parent message:
- Parent message —
thread_ts = NULL, appears in the main channel timeline. - Reply —
thread_ts = parent.ts, appears in the thread view. Optionally also broadcasts to the channel (“Also send to #channel”). - Thread summary — The parent message includes a
reply_countandlatest_replyfor UI preview without loading the full thread. - Thread followers — Users who have replied or explicitly followed the thread receive notifications for new replies.
// Thread query pattern:
// Load all replies in a thread
SELECT * FROM messages
WHERE workspace_id = 'T12345'
AND channel_id = 'C98765'
AND thread_ts = '1714567890.000100'
ORDER BY ts ASC;
// This query is efficient because (workspace_id, channel_id, ts)
// is the primary key, and thread_ts is indexed.
File Handling
Slack processes billions of file uploads — images, PDFs, code snippets, videos. The file pipeline must handle upload, processing, storage, and delivery at scale.
File Pipeline
File Upload Pipeline:
Client → Upload API (multipart/form-data)
│
├── 1. Validate (file size, type, workspace quota)
├── 2. Generate unique file ID and S3 key
├── 3. Stream to Amazon S3 (multi-part upload for large files)
├── 4. Write metadata to MySQL:
│ { file_id, workspace_id, user_id, channel_id,
│ filename, mimetype, size_bytes, s3_key,
│ upload_ts, is_public }
├── 5. Return file URL to client (via CDN)
│
└── 6. Enqueue async processing jobs:
├── "generate_thumbnail" (images/videos → multiple sizes)
├── "virus_scan" (ClamAV scan)
├── "extract_text" (PDFs/docs → plain text for search)
├── "generate_preview" (code syntax highlighting)
└── "transcode_video" (if video, convert to streamable format)
Storage Tiers:
• Hot (S3 Standard): Recent files, frequently accessed
• Warm (S3 IA): Files older than 90 days
• Cold (S3 Glacier): Enterprise retention archive
• CDN (CloudFront): Thumbnails, previews — cached at edge
URL Pattern:
https://files.slack.com/{workspace_id}/{file_id}/{filename}
↓ CloudFront resolves → S3 origin with signed URL (time-limited access)
File Access Security
- Signed URLs — File URLs contain time-limited signatures. Even if a URL is shared externally, it expires within hours.
- Channel-scoped access — The file service checks whether the requesting user is a member of the channel where the file was shared before generating a signed URL.
- Enterprise Key Management (EKM) — For EKM customers, files are encrypted with customer-managed keys. Slack cannot decrypt without the customer’s key.
Enterprise Features
Enterprise Grid is Slack’s tier for the largest organizations. It adds critical features for compliance, security, and administration at scale.
Data Retention Policies
Retention Policy Engine:
┌──────────────────────────────────────────────────┐
│ Retention Policy Configuration │
│ │
│ Workspace-level defaults: │
│ • Keep all messages: forever │
│ • Delete messages older than: 1y / 2y / custom │
│ • Delete files older than: 90d / 1y / custom │
│ │
│ Channel-level overrides: │
│ • #legal-hold: retain forever (override) │
│ • #temp-project: delete after 30 days │
│ │
│ User-level (DMs): │
│ • Follow workspace default or custom policy │
└──────────────────────┬───────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ Retention Worker (daily cron) │
│ │
│ For each workspace with retention policies: │
│ 1. Query messages older than retention period │
│ 2. Check legal hold status (skip if held) │
│ 3. Soft-delete (mark is_deleted = true) │
│ 4. Remove from search index │
│ 5. After grace period, hard-delete from MySQL │
│ 6. Remove files from S3 │
│ 7. Log deletion for compliance audit trail │
└──────────────────────────────────────────────────┘
Compliance Exports
Regulated organizations (finance, healthcare, government) need to export all Slack data for legal discovery. Slack’s compliance export system:
- Full export — All messages, files, and metadata for a workspace or date range. Used for e-discovery and regulatory audits.
- Incremental export — Only data created/modified since the last export. Enables daily compliance pipelines.
- DLP integration — Data Loss Prevention tools can scan messages in real-time and flag or block messages containing sensitive data (credit card numbers, SSNs, proprietary code).
- Legal hold — Administrators can place specific users or channels on legal hold, preventing data deletion regardless of retention policies.
Enterprise Key Management (EKM)
EKM gives customers control over their encryption keys, stored in their own AWS KMS account. This means:
EKM Architecture:
┌──────────────────────────────────────────────────────────────┐
│ Customer's AWS Account │
│ │
│ ┌───────────────────────────┐ │
│ │ AWS KMS │ │
│ │ Customer Master Key │◀── Customer controls access │
│ │ (CMK) │ via IAM policies │
│ └────────────┬──────────────┘ │
└────────────────┼─────────────────────────────────────────────┘
│ Slack requests key access
▼
┌──────────────────────────────────────────────────────────────┐
│ Slack's Infrastructure │
│ │
│ Message/File encryption flow: │
│ 1. Slack generates a data encryption key (DEK) per object │
│ 2. DEK encrypts the message/file content │
│ 3. DEK itself is encrypted by the customer's CMK (KMS) │
│ 4. Encrypted DEK stored alongside the encrypted content │
│ │
│ Decryption flow: │
│ 1. Slack retrieves the encrypted DEK │
│ 2. Calls customer's KMS to decrypt the DEK │
│ 3. Uses decrypted DEK to decrypt the message/file │
│ 4. DEK is never stored in plaintext, only held in memory │
│ │
│ Customer kill switch: │
│ • Revoke Slack's IAM access to KMS │
│ • All data becomes immediately unreadable │
│ • Slack cannot decrypt any messages or files │
│ • Used as emergency data access revocation │
└──────────────────────────────────────────────────────────────┘
Incident Management
Slack has experienced several high-profile outages. Their incident management and postmortem culture is considered industry-leading.
Incident Response Process
Slack Incident Lifecycle:
Detection (automated):
• Datadog monitors → latency spikes, error rate increases
• Synthetic canary tests → health check failures
• Customer reports → status page auto-correlation
│
▼
Triage (0–5 min):
• PagerDuty alerts on-call engineer
• Incident Commander (IC) role assigned
• Severity classified: SEV-1 (full outage) → SEV-4 (minor degradation)
• War room opened (ironically, in Slack — or backup: Zoom bridge)
│
▼
Mitigation (5–60 min):
• IC coordinates parallel investigation streams
• Common mitigations attempted:
- Rollback last deployment (if correlated)
- Drain traffic from unhealthy region
- Disable feature flag that may be causing issues
- Scale up capacity if load-related
- Failover to replica database if primary is unhealthy
│
▼
Resolution:
• Root cause identified and fixed
• Monitoring confirms metrics return to normal
• Customer-facing status page updated
│
▼
Postmortem (within 72 hours):
• Blameless postmortem document written
• Timeline reconstruction (minute-by-minute)
• Root cause analysis (5 Whys)
• Action items with owners and due dates
• Shared across engineering org for learning
Postmortem Culture
Slack’s postmortem process is blameless — the focus is on systemic improvements, not individual blame. Key principles:
- Every SEV-1 and SEV-2 gets a postmortem — no exceptions. The postmortem is a first-class artifact, as important as the fix.
- Action items are tracked to completion — postmortem action items go into the engineering backlog with assigned owners and SLA for completion.
- Patterns are identified — if the same category of issue recurs (e.g., “deploy caused cascading failure”), systemic solutions are prioritized (e.g., improved canary deployment).
- Chaos engineering — Slack proactively tests failure modes: database failovers, region outages, dependency failures. The goal is to discover weaknesses before customers do.
Deployment Strategy
Slack deploys to production multiple times per day with a sophisticated deployment pipeline designed to minimize blast radius.
Canary Deployments
Canary Deployment Pipeline:
┌─────────┐ ┌───────────┐ ┌──────────────┐ ┌───────────┐
│ Build │───▶│ Staging │───▶│ Canary │───▶│ Full │
│ & Test │ │ Deploy │ │ (1% traffic) │ │ Rollout │
└─────────┘ └───────────┘ └──────────────┘ └───────────┘
│
▼
┌──────────────┐
│ Automated │
│ Health Check │
│ │
│ • Error rate │
│ • Latency │
│ • CPU/Memory │
│ • Business │
│ metrics │
└──────┬───────┘
│
Pass? ──┤── Fail?
│ │
▼ ▼
Expand Auto-rollback
to 5% (revert to
→ 25% previous
→ 100% version)
Feature Flags
Slack uses feature flags extensively to decouple deployment from release:
Feature Flag System:
Flag Definition:
{
"flag_name": "new_message_composer_v2",
"default": false,
"rules": [
// Slack employees: always on (dogfood)
{ "workspace_ids": ["T012SLACK"], "value": true },
// Beta testers: on
{ "user_segment": "beta_testers", "value": true },
// Enterprise tier: gradual rollout 25%
{ "plan": "enterprise", "percentage": 25 },
// Everyone else: off
{ "default": false }
],
"kill_switch": true,
"owner": "team-messaging",
"created": "2026-03-15"
}
Evaluation at Runtime:
1. Request hits API server with user context (workspace, user, plan)
2. Flag evaluator checks rules top-to-bottom
3. First matching rule determines flag value
4. Kill switch: set to false for ALL users instantly (no deploy needed)
Flag Categories:
• Release flags: Gate new features during gradual rollout
• Ops flags: Circuit breakers, capacity controls
• Experiment flags: A/B tests with metric tracking
• Permission flags: Feature entitlements by plan tier
Gradual Rollout Strategy
New features follow a multi-stage rollout:
- Stage 0 — Dogfood: Slack’s own workspace gets the feature first. Internal teams use it daily and report issues.
- Stage 1 — Beta: Opt-in workspaces (developer partners, power users) test the feature. Feedback loop is tight.
- Stage 2 — Gradual: 1% → 5% → 25% → 50% → 100% rollout. Each stage has a bake period (24–72 hours) where metrics are monitored.
- Stage 3 — GA: Feature is generally available. Feature flag is cleaned up (removed from code) within 30 days.
Animation: Vitess Sharding
Key Takeaways
- Vitess for sharding — transparent MySQL sharding preserves existing code while enabling horizontal scaling. Workspace-level sharding provides natural data locality.
- Flannel solves the boot problem — caching full workspace state at the edge with incremental delta updates reduces reconnection load by 99.9%.
- Async everything — only the MySQL write is synchronous. Notifications, search indexing, link previews, and analytics are all asynchronous via the job queue.
- WebSocket fan-out — channel-based pub/sub across a fleet of WebSocket gateways enables real-time delivery to millions of concurrent users.
- Search per workspace — per-workspace search indexes enable fast queries and isolate tenant data naturally.
- Feature flags decouple deploy from release — canary deployments catch issues at 1% traffic; feature flags enable instant rollback without redeployment.
- EKM gives enterprises control — customer-managed encryption keys provide a kill switch for data access, enabling adoption in regulated industries.
- Blameless postmortems — treating incidents as systemic learning opportunities builds a culture of reliability and continuous improvement.