← All Posts
High Level Design Series · Case Studies· Post 69 of 70

Case Study: Slack’s Architecture

Slack at Scale

Slack is the workplace messaging platform used by over 20 million daily active users across 750,000+ organizations. At peak, Slack handles millions of WebSocket connections simultaneously, delivers messages in under 500 ms globally, and stores trillions of messages across enterprise workspaces. Every second, hundreds of thousands of messages flow through its infrastructure — each triggering writes, notifications, indexing, and real-time delivery.

What makes Slack architecturally fascinating is the sheer number of hard problems it solves at once:

Key Slack Stats (circa 2024–2025): 20M+ DAU · 750K+ paid workspaces · 1.5B+ messages sent per week · 2,600+ apps in the App Directory · 99.99% uptime target · Messages delivered in <500ms P99

Tech Stack Evolution

Slack’s technology stack has undergone significant evolution since its founding in 2013. Understanding this journey illuminates why certain architectural decisions were made.

The Early Days: PHP & Hack

Slack was originally built as a monolithic PHP application — a pragmatic choice for rapid prototyping during its origin as an internal tool at Tiny Speck (the game company that pivoted to create Slack). The initial stack:

Early Slack Stack (2013–2016):
├── Language: PHP (later migrated to Hack/HHVM)
├── Web Server: Apache → Nginx + PHP-FPM
├── Database: MySQL (single primary, multiple replicas)
├── Cache: Memcached (session state, query results)
├── Queue: Redis-based job queue
├── Search: Apache Solr
├── Real-time: Custom WebSocket server (Node.js)
├── File Storage: Amazon S3
└── CDN: CloudFront for static assets and file delivery

PHP served Slack well to ~1M DAU, but several pain points emerged:

The Modern Stack: Java Services

Starting around 2017, Slack began decomposing the monolith into Java-based microservices. The migration was incremental — the PHP monolith still handles some traffic today, but critical paths run on the Java service mesh:

Modern Slack Stack (2020+):
├── API Layer
│   ├── Hack (HHVM): Legacy API endpoints, web app rendering
│   └── Java (JDK 17+): New API services, high-throughput paths
│
├── Service Mesh
│   ├── gRPC for inter-service communication
│   ├── Envoy sidecar proxies for load balancing, observability
│   └── Service discovery via Consul
│
├── Data Layer
│   ├── MySQL: Primary data store (messages, channels, users)
│   ├── Vitess: MySQL sharding proxy (workspace-level sharding)
│   ├── Memcached: Hot data caching (billions of gets/day)
│   └── Redis: Ephemeral state, presence, rate limiting
│
├── Async Processing
│   ├── Kafka: Event streaming backbone
│   ├── Custom Job Queue: Async task execution
│   └── Redis Streams: Lightweight pub/sub for presence
│
├── Search & Analytics
│   ├── Solr/Elasticsearch: Message search (per-workspace indexes)
│   └── Presto + Hive: Analytics, data warehouse
│
├── Real-time
│   ├── WebSocket Gateway (Java): Persistent connections
│   ├── Channel-based pub/sub fan-out
│   └── Flannel: Edge caching layer
│
└── Infrastructure
    ├── AWS (primary cloud)
    ├── Terraform for infrastructure-as-code
    ├── Kubernetes for container orchestration
    └── Datadog + PagerDuty for monitoring/alerting
Why Java? Slack chose Java for new services due to: (1) mature concurrency primitives (virtual threads, Netty), (2) excellent profiling/debugging tooling, (3) strong type system for large team collaboration, (4) massive ecosystem of battle-tested libraries, and (5) predictable GC performance with ZGC/Shenandoah.

Vitess for MySQL Sharding

Vitess is the cornerstone of Slack’s data tier. Originally developed at YouTube and later donated to CNCF, Vitess provides horizontal sharding for MySQL while preserving the MySQL wire protocol — applications connect to Vitess as if it were a regular MySQL server.

Why Slack Chose Vitess

By 2016, Slack’s single-primary MySQL setup was hitting hard limits. The options were:

OptionProsCons
Vertical scalingNo code changesHard ceiling, exponential cost
NoSQL migrationBuilt-in shardingMassive rewrite, lose ACID, risky at scale
Application-level shardingFull controlEvery service needs shard-aware code
VitessMySQL compatible, proven at YouTube, transparent shardingOperational complexity, learning curve

Slack chose Vitess because it allowed them to shard MySQL horizontally without rewriting application code. Existing MySQL queries continued to work — Vitess transparently routes them to the correct shard.

Vitess Architecture at Slack

                         ┌─────────────────────────┐
                         │      Application         │
                         │   (Java / Hack services)  │
                         └────────────┬──────────────┘
                                      │ MySQL protocol
                                      ▼
                         ┌─────────────────────────┐
                         │        VTGate            │
                         │   (Stateless proxy)      │
                         │   • SQL parsing          │
                         │   • Query routing        │
                         │   • Scatter-gather       │
                         │   • Connection pooling   │
                         └────────────┬──────────────┘
                                      │ gRPC
                    ┌─────────────────┼─────────────────┐
                    ▼                 ▼                  ▼
          ┌──────────────┐ ┌──────────────┐  ┌──────────────┐
          │   VTTablet   │ │   VTTablet   │  │   VTTablet   │
          │   (Shard 0)  │ │   (Shard 1)  │  │   (Shard N)  │
          │              │ │              │  │              │
          │ ┌──────────┐ │ │ ┌──────────┐ │  │ ┌──────────┐ │
          │ │  MySQL    │ │ │ │  MySQL   │ │  │ │  MySQL   │ │
          │ │ Primary + │ │ │ │ Primary +│ │  │ │ Primary +│ │
          │ │ Replicas  │ │ │ │ Replicas │ │  │ │ Replicas │ │
          │ └──────────┘ │ │ └──────────┘ │  │ └──────────┘ │
          └──────────────┘ └──────────────┘  └──────────────┘

          Topology Server (etcd): Stores shard map, tablet health,
          serving state, and schema versions for all keyspaces.

Key Vitess Components

Workspace-Level Sharding

Slack shards primarily by workspace ID. This is the natural partition boundary because:

// Vitess VSchema — workspace-based sharding
{
  "sharded": true,
  "vindexes": {
    "workspace_hash": {
      "type": "xxhash"  // Consistent hashing on workspace_id
    }
  },
  "tables": {
    "messages": {
      "column_vindexes": [
        { "column": "workspace_id", "name": "workspace_hash" }
      ]
    },
    "channels": {
      "column_vindexes": [
        { "column": "workspace_id", "name": "workspace_hash" }
      ]
    },
    "users_workspace": {
      "column_vindexes": [
        { "column": "workspace_id", "name": "workspace_hash" }
      ]
    },
    "files_metadata": {
      "column_vindexes": [
        { "column": "workspace_id", "name": "workspace_hash" }
      ]
    }
  }
}

// Query routing example:
// App sends: SELECT * FROM messages WHERE workspace_id = 'T12345' AND channel_id = 'C98765'
// VTGate:
//   1. Extracts workspace_id = 'T12345'
//   2. Computes xxhash('T12345') → shard range 40-60
//   3. Routes to VTTablet for shard 40-60
//   4. VTTablet executes query on local MySQL
//   5. Returns result to VTGate → Application

Online Resharding

One of Vitess’s killer features is online resharding — splitting or merging shards without downtime. When a shard becomes hot (e.g., a massive enterprise workspace dominates a shard), Slack can split it:

Resharding Process (splitting shard 40-60 → 40-50, 50-60):

Step 1: Create new VTTablets for shards 40-50 and 50-60
Step 2: Start VReplication — stream row changes from old shard to new shards
        (VReplication filters rows by their vindex hash value)
Step 3: Catch-up phase — new shards lag behind by seconds, then milliseconds
Step 4: Cut-over:
        a. Stop writes to old shard (brief ~100ms pause)
        b. Verify new shards are fully caught up
        c. Update topology server: new shards are now serving
        d. Resume writes — now routed to new shards
Step 5: Old shard decommissioned after verification period

Total downtime: <1 second (during cut-over)
Application awareness: Zero — VTGate handles the routing change transparently
Vitess at Slack’s scale: Slack operates hundreds of MySQL shards managed by Vitess, handling billions of queries per day. Each shard runs MySQL 8.0 with one primary and two replicas (one for failover, one for analytics reads).

Job Queue System

Not everything in Slack happens synchronously. Sending a message triggers a cascade of async work — and Slack’s job queue system is the backbone of this asynchronous processing.

Async Task Categories

When a message is sent, the API server writes it to MySQL and returns immediately. Everything else is async:

Message Post → API Server (synchronous: ~50ms)
  │
  ├── Enqueue: "index_message" (Search indexing)
  ├── Enqueue: "send_push_notifications" (iOS/Android push)
  ├── Enqueue: "send_email_notification" (for offline users)
  ├── Enqueue: "execute_workflows" (Workflow Builder triggers)
  ├── Enqueue: "update_unread_counts" (Badge counts)
  ├── Enqueue: "run_app_event_subscriptions" (Bot events)
  ├── Enqueue: "generate_link_previews" (URL unfurling)
  ├── Enqueue: "process_file_attachments" (Thumbnails, virus scan)
  └── Enqueue: "update_analytics" (Message volume metrics)

Queue Architecture

Slack’s job queue evolved from a simple Redis-backed queue to a more sophisticated system:

Job Queue Architecture:

┌─────────────┐    ┌───────────────────────────────────────┐
│  Producers   │    │           Kafka Topic                 │
│ (API servers)│───▶│  "async-jobs" (partitioned by type)   │
└─────────────┘    └───────────────┬───────────────────────┘
                                   │
                   ┌───────────────┼───────────────┐
                   ▼               ▼               ▼
            ┌────────────┐ ┌────────────┐ ┌────────────┐
            │ Worker Pool│ │ Worker Pool│ │ Worker Pool│
            │ "search"   │ │ "notify"   │ │ "webhooks" │
            │ 200 workers│ │ 150 workers│ │ 100 workers│
            └────────────┘ └────────────┘ └────────────┘

Job Priorities:
  P0 (Critical): Push notifications, real-time events — <1s
  P1 (High):     Search indexing, unread counts — <5s
  P2 (Normal):   Link previews, file processing — <30s
  P3 (Low):      Analytics, compliance logging — <5min

Retry Policy:
  • Exponential backoff: 1s → 2s → 4s → 8s → ... → max 5min
  • Max retries: 5 (P0), 10 (P1), 20 (P2/P3)
  • Dead letter queue for permanently failed jobs
  • Idempotency keys prevent duplicate execution

Delivery Guarantees

Slack’s job queue provides at-least-once delivery with idempotency:

Real-Time Messaging

Real-time message delivery is Slack’s core product experience. When Alice sends a message in #engineering, every member of that channel must see it appear within milliseconds — even if there are 10,000 members distributed across 50 countries.

WebSocket Connection Layer

Each Slack client (desktop, mobile, browser) maintains a persistent WebSocket connection to Slack’s WebSocket Gateway:

WebSocket Gateway Architecture:

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Client App  │  │  Client App  │  │  Client App  │
│  (Desktop)   │  │  (Mobile)    │  │  (Browser)   │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │ WSS             │ WSS             │ WSS
       ▼                 ▼                 ▼
┌──────────────────────────────────────────────────┐
│             Load Balancer (L4/L7)                 │
│         (HAProxy / AWS NLB)                       │
│  • Sticky sessions by connection ID               │
│  • Health check WebSocket gateways                │
│  • TLS termination                                │
└──────────────────────┬───────────────────────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
   ┌────────────┐┌────────────┐┌────────────┐
   │ WS Gateway ││ WS Gateway ││ WS Gateway │
   │  Server 1  ││  Server 2  ││  Server N  │
   │            ││            ││            │
   │ ~100K conn ││ ~100K conn ││ ~100K conn │
   │            ││            ││            │
   │ In-memory: ││ In-memory: ││ In-memory: │
   │ user→conn  ││ user→conn  ││ user→conn  │
   │ channel→   ││ channel→   ││ channel→   │
   │   users    ││   users    ││   users    │
   └─────┬──────┘└─────┬──────┘└─────┬──────┘
         │             │             │
         └─────────────┼─────────────┘
                       │ Subscribe
                       ▼
              ┌──────────────────┐
              │   Message Bus    │
              │ (Redis Pub/Sub   │
              │  or Kafka)       │
              └──────────────────┘

Channel-Based Pub/Sub Fan-Out

When a message is posted to a channel, the fan-out process delivers it to every connected member:

Fan-out Process:

1. API server writes message to MySQL (via Vitess)
2. API server publishes event to Message Bus:
   {
     "type": "message",
     "channel": "C98765",
     "workspace": "T12345",
     "user": "U111",
     "text": "Deploying v2.3.1 to production",
     "ts": "1714567890.000100"
   }

3. All WS Gateway servers subscribed to workspace T12345 receive the event
4. Each WS Gateway:
   a. Looks up local channel membership: "Which of MY connected users are in C98765?"
   b. For each matching user, serializes the event and writes to their WebSocket
   c. For users with multiple connected devices, sends to ALL connections

Fan-out complexity:
  • Small channel (10 members, 1 gateway): 10 WebSocket writes
  • Large channel (10,000 members, 50 gateways):
    Each gateway sends to its local subset — average 200 writes each
    Total: 10,000 WebSocket writes across 50 servers in parallel

Message Ordering Guarantees

Slack uses Lamport-style timestamps (ts field) for message ordering within a channel:

Animation: Slack Message Flow

Flannel: Slack’s Edge Caching Layer

Flannel is Slack’s custom client-aware caching layer that sits between the API servers and the clients. It is one of Slack’s most innovative architectural components — purpose-built to solve the boot problem.

The Boot Problem

When a Slack client opens, it needs to load everything about the user’s workspace:

Client boot payload ("rtm.start" or "client.boot"):
├── Workspace metadata (name, icon, settings, plan)
├── User list (every member of the workspace: id, name, avatar, status)
├── Channel list (every channel: name, purpose, membership, unread counts)
├── IM/Group list (direct messages, multi-party DMs)
├── Emoji list (custom emoji: name → URL mappings)
├── User groups (handle groups, their members)
├── Saved items, reminders, preferences
└── Bot users and app configurations

For a 50,000-person workspace, this payload is 5–20 MB of JSON.

Without Flannel, every client boot required the API server to query dozens of MySQL tables, join and serialize massive result sets. At Slack’s scale, this created devastating load — especially after outages when every client reconnects simultaneously (the “thundering herd” problem).

Flannel Architecture

Flannel Architecture:

┌──────────────┐
│  Slack Client │
└──────┬───────┘
       │ "Give me everything for workspace T12345"
       ▼
┌──────────────────────────────────────────────────┐
│                  Flannel Server                   │
│                                                   │
│  ┌─────────────────────────────────────────────┐  │
│  │          In-Memory Cache (per workspace)     │  │
│  │                                              │  │
│  │  T12345 → {                                  │  │
│  │    users: [50,000 user objects],             │  │
│  │    channels: [8,000 channel objects],        │  │
│  │    emoji: [2,000 custom emoji],              │  │
│  │    ...                                       │  │
│  │    version: 4,827,301                        │  │
│  │  }                                           │  │
│  │                                              │  │
│  │  Cache hit? → Return instantly (~5ms)        │  │
│  │  Cache miss? → Fetch from API, cache, return │  │
│  └─────────────────────────────────────────────┘  │
│                                                   │
│  ┌─────────────────────────────────────────────┐  │
│  │       Change Stream Consumer (Kafka)         │  │
│  │                                              │  │
│  │  Listens for workspace mutation events:      │  │
│  │  • user_joined, user_left, profile_updated   │  │
│  │  • channel_created, channel_archived         │  │
│  │  • emoji_added, emoji_removed                │  │
│  │  • preferences_changed                       │  │
│  │                                              │  │
│  │  On event → Apply incremental update to      │  │
│  │  cached workspace state, bump version        │  │
│  └─────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────┘

Key Design Decisions:
  1. Per-workspace caching: Each workspace's full state is a single cache entry
  2. Incremental updates: Flannel never re-fetches the full workspace — it
     applies events as diffs (user added → append to users list)
  3. Versioned state: Every mutation bumps a version counter. Clients can
     request "give me changes since version X" for efficient reconnection
  4. Sharded by workspace: Flannel servers are assigned workspace ranges,
     ensuring each workspace's state lives on exactly one Flannel server

Delta Updates and Reconnection

Flannel’s versioned state enables delta reconnection — when a client reconnects, it sends its last known version and receives only the changes:

// Client reconnects after brief disconnection:
GET /api/flannel.delta?workspace=T12345&since_version=4827290

// Flannel returns only the 11 changes since that version:
{
  "version": 4827301,
  "changes": [
    { "type": "user_status_changed", "user": "U555", "status": "🏖️ PTO" },
    { "type": "channel_created", "channel": { "id": "C99999", "name": "new-project" } },
    { "type": "user_joined_channel", "user": "U123", "channel": "C98765" },
    // ... 8 more incremental changes
  ]
}

// Instead of re-downloading 15 MB, the client receives ~2 KB of deltas.
// This reduces reconnection load by 99.9% during thundering herd scenarios.

Slack search allows users to find any message, file, or conversation across their workspace’s entire history. At scale, this means searching across trillions of messages with sub-second latency.

Search Stack

Slack uses a combination of Solr and Elasticsearch for its search infrastructure, with the stack evolving over time:

Search Architecture:

┌────────────┐    ┌──────────────────────────────────────┐
│  User types │    │          Search API Service          │
│  query      │───▶│                                      │
└────────────┘    │  1. Parse query (operators, filters)  │
                  │  2. Resolve workspace shard            │
                  │  3. Build Solr/ES query DSL            │
                  │  4. Execute against workspace index    │
                  │  5. Re-rank results                    │
                  │  6. Hydrate (load full messages)       │
                  │  7. Apply access control filtering     │
                  │  8. Return results                     │
                  └───────────────┬──────────────────────┘
                                  │
                                  ▼
                  ┌──────────────────────────────────────┐
                  │     Search Cluster (per workspace)    │
                  │                                       │
                  │  Index per workspace:                  │
                  │  ┌───────────────────────────────┐    │
                  │  │ messages: {                    │    │
                  │  │   ts, text, user_id,           │    │
                  │  │   channel_id, channel_type,    │    │
                  │  │   has_file, has_link,           │    │
                  │  │   reactions, thread_ts,         │    │
                  │  │   workspace_id                  │    │
                  │  │ }                               │    │
                  │  │ files: {                        │    │
                  │  │   name, content_text,           │    │
                  │  │   file_type, user_id,           │    │
                  │  │   channel_id, upload_ts         │    │
                  │  │ }                               │    │
                  │  └───────────────────────────────┘    │
                  └──────────────────────────────────────┘

Indexing Pipeline

Messages are indexed asynchronously via the job queue system. The indexing pipeline:

Indexing Pipeline:

1. Message written to MySQL (via Vitess)
2. "index_message" job enqueued to Kafka
3. Search Indexer Worker picks up job:
   a. Fetches full message from MySQL (with context: channel name, user info)
   b. Tokenizes text content
   c. Extracts entities (mentions, links, emoji, code blocks)
   d. Generates search document:
      {
        "ts": "1714567890.000100",
        "text": "Deploying v2.3.1 to production @oncall",
        "text_analyzed": ["deploy", "v2.3.1", "production", "oncall"],
        "user_id": "U111",
        "user_name": "alice",
        "channel_id": "C98765",
        "channel_name": "engineering",
        "channel_type": "public",
        "has_mention": true,
        "mentioned_users": ["U222"],
        "has_link": false,
        "has_file": false,
        "reactions": [],
        "thread_ts": null,
        "workspace_id": "T12345"
      }
   e. Sends document to Solr/ES for the workspace's index
   f. Acknowledges job completion

Indexing Latency: P50 = 2s, P99 = 8s (messages become searchable within seconds)
Throughput: Hundreds of thousands of messages indexed per second globally

Access Control in Search

Search results must respect channel permissions. A user cannot find messages from private channels they’re not a member of:

Search operators Slack supports rich query syntax: from:@alice, in:#engineering, has:link, before:2024-01-01, during:March. These are parsed by the Search API into structured filters applied at the Solr/ES level, avoiding full-text scans for filtered queries.

Channel & DM Architecture

Channels are the fundamental abstraction in Slack — every conversation happens in a channel, whether it’s a public team channel, a private group, or a direct message.

Channel Types

TypePrefixMax MembersVisibility
Public ChannelCUnlimitedDiscoverable, joinable by anyone
Private ChannelGUnlimitedInvite-only, hidden from non-members
DM (1:1)D2Private between two users
Group DM (MPDM)G9Private group conversation
Shared ChannelCUnlimitedSpans multiple workspaces (Slack Connect)

Message Data Model

-- Core message schema (simplified from Slack's actual schema)
CREATE TABLE messages (
    workspace_id   VARCHAR(12) NOT NULL,   -- Shard key (Vitess vindex)
    channel_id     VARCHAR(12) NOT NULL,
    ts             VARCHAR(20) NOT NULL,   -- "1714567890.000100" (unique per channel)
    user_id        VARCHAR(12) NOT NULL,
    text           TEXT,
    thread_ts      VARCHAR(20),            -- Parent message ts (NULL if not in thread)
    subtype        VARCHAR(30),            -- 'bot_message', 'file_share', etc.
    edited_ts      VARCHAR(20),            -- Timestamp of last edit
    is_deleted     BOOLEAN DEFAULT FALSE,
    reactions      JSON,                   -- [{"name":"thumbsup","users":["U111","U222"]}]
    files          JSON,                   -- Attached file metadata
    blocks         JSON,                   -- Block Kit structured content
    PRIMARY KEY (workspace_id, channel_id, ts)
) ENGINE=InnoDB;

-- Channel membership (denormalized for fast lookups)
CREATE TABLE channel_members (
    workspace_id   VARCHAR(12) NOT NULL,
    channel_id     VARCHAR(12) NOT NULL,
    user_id        VARCHAR(12) NOT NULL,
    joined_at      BIGINT NOT NULL,
    last_read_ts   VARCHAR(20),           -- For unread tracking
    is_muted       BOOLEAN DEFAULT FALSE,
    notification_pref VARCHAR(20) DEFAULT 'default',
    PRIMARY KEY (workspace_id, channel_id, user_id),
    INDEX idx_user_channels (workspace_id, user_id)
) ENGINE=InnoDB;

Thread Support

Threads are implemented as messages with a non-null thread_ts that references the parent message:

// Thread query pattern:
// Load all replies in a thread
SELECT * FROM messages
WHERE workspace_id = 'T12345'
  AND channel_id = 'C98765'
  AND thread_ts = '1714567890.000100'
ORDER BY ts ASC;

// This query is efficient because (workspace_id, channel_id, ts)
// is the primary key, and thread_ts is indexed.

File Handling

Slack processes billions of file uploads — images, PDFs, code snippets, videos. The file pipeline must handle upload, processing, storage, and delivery at scale.

File Pipeline

File Upload Pipeline:

Client → Upload API (multipart/form-data)
           │
           ├── 1. Validate (file size, type, workspace quota)
           ├── 2. Generate unique file ID and S3 key
           ├── 3. Stream to Amazon S3 (multi-part upload for large files)
           ├── 4. Write metadata to MySQL:
           │      { file_id, workspace_id, user_id, channel_id,
           │        filename, mimetype, size_bytes, s3_key,
           │        upload_ts, is_public }
           ├── 5. Return file URL to client (via CDN)
           │
           └── 6. Enqueue async processing jobs:
                  ├── "generate_thumbnail" (images/videos → multiple sizes)
                  ├── "virus_scan" (ClamAV scan)
                  ├── "extract_text" (PDFs/docs → plain text for search)
                  ├── "generate_preview" (code syntax highlighting)
                  └── "transcode_video" (if video, convert to streamable format)

Storage Tiers:
  • Hot (S3 Standard): Recent files, frequently accessed
  • Warm (S3 IA): Files older than 90 days
  • Cold (S3 Glacier): Enterprise retention archive
  • CDN (CloudFront): Thumbnails, previews — cached at edge

URL Pattern:
  https://files.slack.com/{workspace_id}/{file_id}/{filename}
  ↓ CloudFront resolves → S3 origin with signed URL (time-limited access)

File Access Security

Enterprise Features

Enterprise Grid is Slack’s tier for the largest organizations. It adds critical features for compliance, security, and administration at scale.

Data Retention Policies

Retention Policy Engine:

┌──────────────────────────────────────────────────┐
│           Retention Policy Configuration          │
│                                                   │
│  Workspace-level defaults:                        │
│    • Keep all messages: forever                   │
│    • Delete messages older than: 1y / 2y / custom │
│    • Delete files older than: 90d / 1y / custom   │
│                                                   │
│  Channel-level overrides:                         │
│    • #legal-hold: retain forever (override)       │
│    • #temp-project: delete after 30 days          │
│                                                   │
│  User-level (DMs):                                │
│    • Follow workspace default or custom policy    │
└──────────────────────┬───────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────┐
│          Retention Worker (daily cron)            │
│                                                   │
│  For each workspace with retention policies:      │
│    1. Query messages older than retention period   │
│    2. Check legal hold status (skip if held)       │
│    3. Soft-delete (mark is_deleted = true)         │
│    4. Remove from search index                     │
│    5. After grace period, hard-delete from MySQL   │
│    6. Remove files from S3                         │
│    7. Log deletion for compliance audit trail      │
└──────────────────────────────────────────────────┘

Compliance Exports

Regulated organizations (finance, healthcare, government) need to export all Slack data for legal discovery. Slack’s compliance export system:

Enterprise Key Management (EKM)

EKM gives customers control over their encryption keys, stored in their own AWS KMS account. This means:

EKM Architecture:

┌──────────────────────────────────────────────────────────────┐
│                     Customer's AWS Account                    │
│                                                               │
│   ┌───────────────────────────┐                               │
│   │    AWS KMS                │                               │
│   │    Customer Master Key    │◀── Customer controls access   │
│   │    (CMK)                  │    via IAM policies            │
│   └────────────┬──────────────┘                               │
└────────────────┼─────────────────────────────────────────────┘
                 │ Slack requests key access
                 ▼
┌──────────────────────────────────────────────────────────────┐
│                     Slack's Infrastructure                    │
│                                                               │
│   Message/File encryption flow:                               │
│   1. Slack generates a data encryption key (DEK) per object   │
│   2. DEK encrypts the message/file content                    │
│   3. DEK itself is encrypted by the customer's CMK (KMS)      │
│   4. Encrypted DEK stored alongside the encrypted content     │
│                                                               │
│   Decryption flow:                                            │
│   1. Slack retrieves the encrypted DEK                        │
│   2. Calls customer's KMS to decrypt the DEK                  │
│   3. Uses decrypted DEK to decrypt the message/file           │
│   4. DEK is never stored in plaintext, only held in memory    │
│                                                               │
│   Customer kill switch:                                       │
│   • Revoke Slack's IAM access to KMS                          │
│   • All data becomes immediately unreadable                   │
│   • Slack cannot decrypt any messages or files                │
│   • Used as emergency data access revocation                  │
└──────────────────────────────────────────────────────────────┘

Incident Management

Slack has experienced several high-profile outages. Their incident management and postmortem culture is considered industry-leading.

Incident Response Process

Slack Incident Lifecycle:

Detection (automated):
  • Datadog monitors → latency spikes, error rate increases
  • Synthetic canary tests → health check failures
  • Customer reports → status page auto-correlation
      │
      ▼
Triage (0–5 min):
  • PagerDuty alerts on-call engineer
  • Incident Commander (IC) role assigned
  • Severity classified: SEV-1 (full outage) → SEV-4 (minor degradation)
  • War room opened (ironically, in Slack — or backup: Zoom bridge)
      │
      ▼
Mitigation (5–60 min):
  • IC coordinates parallel investigation streams
  • Common mitigations attempted:
    - Rollback last deployment (if correlated)
    - Drain traffic from unhealthy region
    - Disable feature flag that may be causing issues
    - Scale up capacity if load-related
    - Failover to replica database if primary is unhealthy
      │
      ▼
Resolution:
  • Root cause identified and fixed
  • Monitoring confirms metrics return to normal
  • Customer-facing status page updated
      │
      ▼
Postmortem (within 72 hours):
  • Blameless postmortem document written
  • Timeline reconstruction (minute-by-minute)
  • Root cause analysis (5 Whys)
  • Action items with owners and due dates
  • Shared across engineering org for learning

Postmortem Culture

Slack’s postmortem process is blameless — the focus is on systemic improvements, not individual blame. Key principles:

Deployment Strategy

Slack deploys to production multiple times per day with a sophisticated deployment pipeline designed to minimize blast radius.

Canary Deployments

Canary Deployment Pipeline:

┌─────────┐    ┌───────────┐    ┌──────────────┐    ┌───────────┐
│  Build   │───▶│  Staging   │───▶│    Canary     │───▶│  Full     │
│  & Test  │    │  Deploy    │    │  (1% traffic) │    │  Rollout  │
└─────────┘    └───────────┘    └──────────────┘    └───────────┘
                                       │
                                       ▼
                                ┌──────────────┐
                                │  Automated    │
                                │  Health Check │
                                │              │
                                │  • Error rate │
                                │  • Latency    │
                                │  • CPU/Memory │
                                │  • Business   │
                                │    metrics    │
                                └──────┬───────┘
                                       │
                              Pass?  ──┤── Fail?
                              │        │
                              ▼        ▼
                         Expand   Auto-rollback
                         to 5%    (revert to
                         → 25%    previous
                         → 100%   version)

Feature Flags

Slack uses feature flags extensively to decouple deployment from release:

Feature Flag System:

Flag Definition:
{
  "flag_name": "new_message_composer_v2",
  "default": false,
  "rules": [
    // Slack employees: always on (dogfood)
    { "workspace_ids": ["T012SLACK"], "value": true },
    // Beta testers: on
    { "user_segment": "beta_testers", "value": true },
    // Enterprise tier: gradual rollout 25%
    { "plan": "enterprise", "percentage": 25 },
    // Everyone else: off
    { "default": false }
  ],
  "kill_switch": true,
  "owner": "team-messaging",
  "created": "2026-03-15"
}

Evaluation at Runtime:
  1. Request hits API server with user context (workspace, user, plan)
  2. Flag evaluator checks rules top-to-bottom
  3. First matching rule determines flag value
  4. Kill switch: set to false for ALL users instantly (no deploy needed)

Flag Categories:
  • Release flags: Gate new features during gradual rollout
  • Ops flags: Circuit breakers, capacity controls
  • Experiment flags: A/B tests with metric tracking
  • Permission flags: Feature entitlements by plan tier

Gradual Rollout Strategy

New features follow a multi-stage rollout:

Animation: Vitess Sharding

Key Takeaways