High Level Design Series · Real-World Designs· Post 60 of 70

Design: Stock Exchange

April 2026 · 28 min read

Introduction

A stock exchange is the backbone of global financial markets. Every day, billions of dollars change hands through electronic exchanges like the NYSE, NASDAQ, and the London Stock Exchange. At its heart, an exchange does something deceptively simple: it matches buyers with sellers. But doing this at sub-millisecond latency, for millions of orders per day, with perfect accuracy and complete auditability, is one of the most demanding system design challenges in existence.

Modern exchanges process peak throughputs of over 100,000 messages per second with median latencies under 50 microseconds. A single bit flip, a single out-of-order message, or a single lost trade can trigger regulatory investigations and cost millions. There is no room for "eventual consistency" in an exchange — every order, every match, every cancellation must be deterministic, auditable, and exactly-once.

In this post, we'll design a stock exchange from scratch, covering order types, order book mechanics, the matching engine, sequencing, market data distribution, risk management, and the FIX protocol. We'll also dive deep into the latency optimization techniques that separate real trading systems from typical distributed systems.

Scope: We're designing a securities exchange — an electronic venue where participants place orders to buy or sell financial instruments (stocks, ETFs, options). We will focus on equities (stocks) for clarity, but the architecture generalizes to other instrument types.

Requirements

Functional Requirements

Place orders — participants can submit buy and sell orders (market, limit, stop) for listed instruments
Cancel / modify orders — participants can cancel open orders or modify quantity/price
Match buy and sell orders — the engine matches compatible orders using price-time priority
Market data feed — broadcast real-time price updates, order book depth, and trade reports to all subscribers
Trade reporting — generate confirmations for matched trades and report to clearing/settlement systems
Risk checks — pre-trade validation of account balances, position limits, and regulatory constraints
Support multiple instruments — the exchange lists thousands of symbols (e.g., AAPL, GOOG, TSLA)

Non-Functional Requirements

Metric	Target
Matching latency (p99)	< 1 ms (top exchanges target < 100 μs)
Throughput	Millions of orders/day (100K+ messages/sec peak)
Availability	99.99% during market hours (6.5 hrs/day for US equities)
Correctness	Zero tolerance for lost orders, duplicate executions, or reordering
Fairness	All participants see the same data at the same time; FIFO ordering at each price level
Auditability	Every event must be logged and replayable for regulatory compliance

Why sub-millisecond matters: In modern markets, a 1 ms advantage translates directly to profit. High-frequency trading (HFT) firms invest hundreds of millions in infrastructure to shave microseconds off round-trip times. The exchange itself must be faster than its fastest participants, or it becomes the bottleneck.

Order Types

Before designing the matching engine, we need to understand the different kinds of orders that participants can place. Each order type has different matching semantics and edge cases.

Market Order

A market order says: "I want to buy (or sell) N shares immediately at the best available price." Market orders prioritize execution speed over price. They are guaranteed to execute (as long as there is liquidity on the other side), but the execution price may be worse than expected if the order book is thin.

Market Buy Order:
  Symbol:   AAPL
  Side:     BUY
  Quantity: 100 shares
  Price:    N/A (take the best available ask price)

Execution: Immediately matches against the lowest-priced
           sell orders in the order book.

Risk: In illiquid markets, a large market order can "walk the book" — consuming multiple price levels and resulting in a much worse average execution price than expected. This is called slippage.

Limit Order

A limit order says: "I want to buy (or sell) N shares at price P or better." A limit buy order will only execute at price ≤ P. A limit sell order will only execute at price ≥ P. If no matching counterparty exists, the order rests in the order book until filled, cancelled, or expired.

Limit Buy Order:
  Symbol:   AAPL
  Side:     BUY
  Quantity: 100 shares
  Price:    $150.00 (max price willing to pay)

Execution: Matches against sell orders priced ≤ $150.00.
           If no match, rests in the order book at $150.00.

Limit Sell Order:
  Symbol:   AAPL
  Side:     SELL
  Quantity: 50 shares
  Price:    $151.00 (min price willing to accept)

Execution: Matches against buy orders priced ≥ $151.00.
           If no match, rests in the order book at $151.00.

Limit orders form the backbone of the order book. The vast majority of orders on any exchange are limit orders — they provide liquidity by indicating willingness to trade at specific prices.

Stop Order

A stop order is a conditional order: "When the market price reaches P, convert this into a market order." Stop orders are used for risk management — a stop-loss sell order, for example, automatically sells a position if the price drops below a threshold.

Stop-Loss Sell Order:
  Symbol:    AAPL
  Side:      SELL
  Quantity:  100 shares
  Stop Price: $145.00
  
Behavior:  When AAPL trades at or below $145.00, this
           becomes a market sell order and executes
           at the best available bid price.

Stop-Limit Sell Order:
  Symbol:    AAPL
  Side:      SELL
  Quantity:  100 shares
  Stop Price: $145.00
  Limit Price: $144.50

Behavior:  When AAPL trades at or below $145.00, this
           becomes a limit sell order at $144.50.

Time-in-Force Qualifiers

Every order also has a time-in-force (TIF) qualifier that specifies how long it remains active:

Day — valid until end of trading session (most common)
GTC (Good-Till-Cancelled) — remains active across sessions until filled or cancelled
IOC (Immediate-or-Cancel) — execute as much as possible immediately, cancel the rest
FOK (Fill-or-Kill) — execute the entire quantity immediately, or cancel the entire order
GTD (Good-Till-Date) — remains active until a specified date

Additional Order Types

Iceberg / Reserve — only a portion of the total quantity is visible in the order book; the rest is hidden and replenished as the visible portion fills
Pegged — price automatically adjusts to track the best bid/ask (e.g., midpoint peg)
All-or-None (AON) — must be filled entirely in a single match, or not at all

The Order Book

The order book is the central data structure of any exchange. It maintains all outstanding (resting) limit orders for a given instrument, organized by price and arrival time. There is one order book per symbol.

Structure

An order book has two sides:

Buy side (bids) — sorted in descending order by price. The highest bid is at the top because buyers want to pay as little as possible, but the most aggressive buyer (highest price) gets priority.
Sell side (asks/offers) — sorted in ascending order by price. The lowest ask is at the top because sellers want to receive as much as possible, but the most aggressive seller (lowest price) gets priority.

Order Book: AAPL
═══════════════════════════════════════════════
         BIDS (Buy)      │     ASKS (Sell)
─────────────────────────┼─────────────────────
  Price    Qty    Orders  │  Price    Qty    Orders
 $150.10   300   [3 ord]  │ $150.20   150   [2 ord]
 $150.00   500   [5 ord]  │ $150.30   200   [1 ord]
 $149.90   200   [2 ord]  │ $150.50   400   [4 ord]
 $149.80   100   [1 ord]  │ $150.60   100   [1 ord]
═══════════════════════════════════════════════
  Best Bid: $150.10       │ Best Ask: $150.20
  Spread: $0.10 (0.07%)
═══════════════════════════════════════════════

The spread is the difference between the best bid and the best ask. A tight spread indicates a liquid market with active trading. The spread is never negative on a well-functioning exchange — if the best bid exceeds the best ask, those orders would immediately match.

Price Levels

At each price level, multiple orders may exist. Within a price level, orders are arranged in FIFO (first-in, first-out) order — the order that arrived first gets filled first. This combination of price priority and time priority is called price-time priority (or strict FIFO matching).

Price Level: $150.10 (Buy side)
───────────────────────────────────────
  Seq#   Time       Qty   Trader
  1042   09:30:01   100   FirmA
  1087   09:30:02   150   FirmB
  1103   09:30:03    50   FirmC
───────────────────────────────────────
  Total: 300 shares across 3 orders

When a sell order arrives that matches at $150.10:
  → FirmA's order fills first (arrived earliest)
  → Then FirmB, then FirmC

Data Structure Implementation

The order book requires efficient operations on three axes:

Operation	Frequency	Target
Find best bid/ask	Every incoming order	O(1)
Add order at price level	Every new limit order	O(1) amortized
Remove order by ID	Every cancel/fill	O(1)
Walk price levels (matching)	Every crossing order	O(levels consumed)
Remove empty price level	When last order at a level fills	O(log P)

A common implementation uses:

Sorted map (red-black tree or skip list) keyed by price → gives O(log P) insert/delete of price levels and O(1) access to best price
Doubly-linked list at each price level → O(1) append (new order) and O(1) removal (cancel) with pointer from order ID
Hash map from order ID → order node → O(1) lookup for cancels and modifications

struct OrderBook {
    bids: BTreeMap<Price, PriceLevel>,  // descending
    asks: BTreeMap<Price, PriceLevel>,  // ascending
    orders: HashMap<OrderId, *mut OrderNode>,
}

struct PriceLevel {
    price: Price,
    total_quantity: u64,
    head: *mut OrderNode,   // first order (FIFO front)
    tail: *mut OrderNode,   // last order (FIFO back)
}

struct OrderNode {
    order_id: OrderId,
    quantity: u64,
    timestamp: u64,
    prev: *mut OrderNode,
    next: *mut OrderNode,
}

Why not a heap? A min-heap/max-heap seems natural for "get best price," but heaps don't support O(1) removal of arbitrary elements (needed for cancel operations). They also don't support walking levels in order efficiently. A sorted tree provides the best balance of all operations.

The Matching Engine

The matching engine is the heart of the exchange. It receives orders, applies them to the order book, and generates trades (executions) when buy and sell orders are compatible. The matching engine for a given symbol is a single-threaded, deterministic state machine — this is a critical design decision.

Price-Time Priority Algorithm

The most common matching algorithm is price-time priority (also called strict FIFO). The rules are simple:

Price priority: The best-priced order on the opposite side gets matched first. For a buy, this means the lowest-priced sell. For a sell, this means the highest-priced buy.
Time priority: Among orders at the same price, the order that arrived first gets matched first (FIFO).
Crossing condition: A buy order matches with a sell order when the buy price ≥ the sell price.

Matching Algorithm (Pseudocode)

function match(incoming_order, order_book):
    if incoming_order.side == BUY:
        opposite_book = order_book.asks    // ascending price
    else:
        opposite_book = order_book.bids    // descending price

    remaining_qty = incoming_order.quantity

    while remaining_qty > 0 and opposite_book is not empty:
        best_level = opposite_book.first()  // best price level

        // Check crossing condition
        if incoming_order.side == BUY:
            if incoming_order is MARKET or incoming_order.price >= best_level.price:
                // Prices cross — match!
            else:
                break  // No match possible
        else:  // SELL
            if incoming_order is MARKET or incoming_order.price <= best_level.price:
                // Prices cross — match!
            else:
                break

        // Walk the FIFO queue at this price level
        while remaining_qty > 0 and best_level is not empty:
            resting_order = best_level.head  // oldest order

            fill_qty = min(remaining_qty, resting_order.quantity)
            fill_price = resting_order.price  // resting order's price

            // Generate execution report
            emit Trade(
                buy_order_id,
                sell_order_id,
                symbol,
                fill_qty,
                fill_price,
                timestamp
            )

            remaining_qty -= fill_qty
            resting_order.quantity -= fill_qty

            if resting_order.quantity == 0:
                remove resting_order from level and order map

        if best_level is empty:
            remove best_level from opposite_book

    // Handle remaining quantity
    if remaining_qty > 0:
        if incoming_order is LIMIT:
            add incoming_order (with remaining_qty) to same-side book
        else:  // MARKET order with unfilled quantity
            cancel remaining (or reject per exchange policy)

    return trades

Execution Price Rule

When two orders match, which price is used? The standard rule is: the resting order's price. The order that was already in the book (passive) determines the trade price, not the incoming (aggressive) order. This incentivizes providing liquidity — your limit order will fill at the price you specified, never worse.

Example: Buy limit order at $50.10 arrives. The best ask is $50.00. The trade executes at $50.00 (the resting order's price), which is actually better than the buyer expected. This is called price improvement.

Alternative Matching Algorithms

While price-time priority is the most common, some venues use alternative algorithms:

Pro-rata — at the same price level, quantity is allocated proportionally to order size rather than FIFO. Used in some options and futures exchanges (e.g., CME for some contracts).
Price-size-time — larger orders at the same price get priority over smaller ones. Rarely used due to fairness concerns.
Price-time with allocation minimum — pro-rata with a minimum fill size to prevent tiny allocations.

▶ Order Book Matching

Watch a market buy order for 150 shares match against the sell side of the order book using price-time priority.

The Sequencer

In a stock exchange, ordering is everything. If two participants submit orders at nearly the same time, who gets priority? The answer must be deterministic, reproducible, and fair. This is the sequencer's job.

Why We Need a Sequencer

Consider the problem: multiple gateway servers receive orders from different participants over the network. Each gateway has a slightly different local clock. Network jitter means packets arrive at different times. Without a central sequencing mechanism, there is no single truth about the order of events.

The sequencer solves this by assigning a monotonically increasing sequence number to every incoming message before it reaches the matching engine. This creates a single, totally ordered stream of events.

Without Sequencer:
  Gateway A receives Order X at T=09:30:00.001234
  Gateway B receives Order Y at T=09:30:00.001235
  Which arrived first? Depends on clock sync accuracy.

With Sequencer:
  Sequencer assigns Order X → Seq #10042
  Sequencer assigns Order Y → Seq #10043
  Order X is processed first. Period. Deterministic.

Sequencer Design

The sequencer is a simple but critical component:

Receives messages from all gateways
Assigns a globally unique, monotonically increasing sequence number
Writes the sequenced message to a durable log
Forwards the sequenced message to the matching engine

The sequencer must be single-threaded — a global sequence number requires a single point of serialization. This sounds like a bottleneck, but in practice a well-optimized sequencer can stamp millions of messages per second on modern hardware.

Deterministic Replay

The sequencer enables one of the most powerful properties of exchange architecture: deterministic replay. Since every message has a sequence number and the matching engine is a deterministic state machine, you can replay the entire sequence of events and reconstruct the exact state of the order book at any point in time.

Replay Guarantee:
  Given the same sequence of inputs (with sequence numbers),
  the matching engine will produce the exact same sequence
  of outputs (trades, order book states).

This enables:
  ✓ Disaster recovery — replay from the log to rebuild state
  ✓ Regulatory audit — reconstruct the market at any timestamp
  ✓ Testing — replay production sequences in test environments
  ✓ Hot standby — a secondary matching engine replays the
    same stream and can take over instantly on failure

High Availability

The sequencer is a single point of failure (by design — it must be single-threaded). To achieve high availability:

Primary-backup replication: The primary sequencer synchronously replicates each sequenced message to a hot standby before acknowledging. On failure, the standby takes over from the last replicated sequence number.
Raft/Paxos for leader election: A consensus protocol can manage the failover, but adds latency to every message. Some exchanges accept this trade-off; others use simpler heartbeat-based failover.
Dual-write: Write every message to two independent logs. On failover, the new primary reads from the log and continues from the last sequence number.

▶ Sequencer Flow

Multiple gateways receive orders → Sequencer assigns global sequence numbers → Matching Engine processes in order → deterministic replay possible.

High-Level Architecture

The end-to-end architecture of a stock exchange follows a strict pipeline:

                          ┌──────────────┐
   Participants ──────────▶│   Gateways    │
   (FIX clients)          │ (Validate &   │
                          │  Authenticate) │
                          └──────┬───────┘
                                 │
                                 ▼
                          ┌──────────────┐
                          │  Sequencer   │
                          │ (Assign seq# │
                          │  + persist)  │
                          └──────┬───────┘
                                 │
                          ┌──────┴───────┐
                          │              │
                          ▼              ▼
                   ┌────────────┐ ┌────────────┐
                   │ Risk Check │ │  Matching   │
                   │  Engine    │ │  Engine     │
                   │(Pre-trade) │ │(Per-symbol) │
                   └────────────┘ └──────┬─────┘
                                         │
                          ┌──────────────┼──────────────┐
                          ▼              ▼              ▼
                   ┌────────────┐ ┌────────────┐ ┌────────────┐
                   │  Market    │ │  Trade     │ │ Reporting  │
                   │  Data Pub  │ │ Confirm    │ │ / Clearing │
                   └────────────┘ └────────────┘ └────────────┘

Component Breakdown

1. Gateway

The gateway is the entry point for all client connections. It handles:

Protocol translation: Parses FIX messages (or proprietary binary protocols) from clients
Authentication: Verifies client identity and session keys
Basic validation: Checks message format, required fields, symbol validity
Rate limiting: Prevents a single participant from overwhelming the system
Connection management: Manages TCP sessions, heartbeats, and reconnection logic

Multiple gateway instances run in parallel for scalability. Each client connects to one gateway, but gateways are stateless — clients can reconnect to any gateway after a failure.

2. Sequencer

As described above, the sequencer assigns a global ordering to all messages. It sits between the gateways and the matching engine, creating a single stream of sequenced events.

3. Risk Check Engine

Before an order reaches the matching engine, it must pass pre-trade risk checks:

Account balance: Does the buyer have sufficient funds? Does the seller have the shares?
Position limits: Would this order exceed the firm's maximum position in this instrument?
Order size limits: Is the order quantity within acceptable bounds?
Price collars: Is the order price within a reasonable range of the current market price? (Prevents fat-finger errors)
Regulatory limits: Short-selling restrictions, circuit breakers, etc.

Risk checks must be fast — they're in the critical path. Some exchanges perform risk checks before sequencing (at the gateway), others after sequencing but before matching. The trade-off is between latency and consistency.

4. Matching Engine

The matching engine runs the price-time priority algorithm described earlier. A critical design decision is: one matching engine instance per symbol. This avoids the need for distributed state, locks, or consensus within the matching logic.

Symbol Partitioning:
  AAPL  → Matching Engine #1 (Single thread)
  GOOG  → Matching Engine #2 (Single thread)
  TSLA  → Matching Engine #3 (Single thread)
  AMZN  → Matching Engine #4 (Single thread)
  ...

Each matching engine:
  ✓ Single-threaded — no locks, no contention
  ✓ Deterministic — same inputs → same outputs
  ✓ In-memory — entire order book in RAM
  ✓ Pinned to a CPU core — avoids context switches

Why single-threaded? Matching is inherently sequential — Order B's outcome depends on Order A's execution. Multi-threading would require locks on the order book, causing contention that increases latency. A single-threaded loop on a dedicated CPU core with the order book in L1/L2 cache achieves the lowest possible latency.

5. Market Data Publisher

After every match or order book change, the market data publisher broadcasts updates to all subscribers. This component is discussed in detail below.

6. Trade Confirmation & Reporting

Executed trades generate confirmations sent back to both buyer and seller. Trade reports are also sent to:

Clearing house: For trade settlement (T+1 or T+2)
Regulatory systems: For market surveillance and audit trails
Drop copy feeds: Backup trade reports for compliance systems

Market Data

Market data is the real-time feed of price and order information that the exchange broadcasts to all participants. It's how traders know what's happening in the market. There are different levels of market data, each with increasing detail.

Level 1 Data (Top of Book)

Level 1 provides the most basic information:

Best Bid: highest buy price + quantity
Best Ask: lowest sell price + quantity
Last Trade: price + quantity of the most recent execution
Volume: total shares traded today

Level 1 Market Data (AAPL):
  Best Bid:   $150.10 × 300
  Best Ask:   $150.20 × 150
  Last Trade: $150.15 × 50
  Volume:     12,345,678
  High:       $151.20
  Low:        $149.80
  Open:       $150.00

Level 2 Data (Depth of Book)

Level 2 shows the full order book depth — all price levels with their aggregate quantities:

Level 2 Market Data (AAPL):
  BIDS                      ASKS
  $150.10  300  (3 orders)  $150.20  150  (2 orders)
  $150.00  500  (5 orders)  $150.30  200  (1 order)
  $149.90  200  (2 orders)  $150.50  400  (4 orders)
  $149.80  100  (1 order)   $150.60  100  (1 order)
  $149.70  350  (4 orders)  $150.70  250  (3 orders)

Level 2 data is essential for algorithmic trading and market making. By seeing the full depth, traders can estimate how much liquidity is available at each price and predict how their orders will fill.

Level 3 Data (Full Order-by-Order)

Level 3 provides every individual order in the book (not just aggregated levels). This is the most granular data:

Level 3 Feed Messages:
  ADD   OrderID=1042  BUY  100@$150.10  09:30:01.234
  ADD   OrderID=1043  BUY  150@$150.10  09:30:01.456
  ADD   OrderID=1044  SELL  50@$150.20  09:30:01.789
  TRADE OrderID=1045 vs 1044  50@$150.20
  CANCEL OrderID=1042  BUY 100@$150.10

With Level 3 data, a subscriber can reconstruct the exact order book locally, including individual order positions within each price level.

Feed Handlers & Multicast UDP

Market data must be delivered to thousands of subscribers simultaneously with minimal latency. The standard approach uses IP multicast over UDP:

Multicast: A single packet is sent to a multicast group address, and the network hardware (switches, routers) replicates it to all subscribers. This is far more efficient than sending unicast TCP to each subscriber individually.
UDP: No connection setup, no acknowledgments, no retransmission — minimum latency. But UDP is unreliable: packets can be lost, duplicated, or reordered.
Sequence numbers: Every market data message includes a sequence number. Subscribers detect gaps and request retransmissions from a recovery server (typically TCP-based).

Market Data Distribution:
  
  Matching Engine → Market Data Publisher
                         │
                    Multicast UDP
                    (239.1.2.3:50001)
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          Subscriber  Subscriber  Subscriber
          (FirmA)     (FirmB)     (FirmC)
          
  Gap detected? → Request retransmit via TCP recovery

  Multicast Benefits:
    ✓ Fair — all subscribers receive at the same time
    ✓ Efficient — one send, many receivers
    ✓ Low latency — no TCP handshake or ACK overhead
    
  Multicast Risks:
    ✗ Packet loss (mitigated by sequence numbers + recovery)
    ✗ Out-of-order (mitigated by sequence numbers)
    ✗ Network congestion (mitigated by rate limiting)

Fairness requirement: Regulators require that all participants receive market data at the same time. Multicast inherently provides this — the network delivers the same packet to everyone simultaneously. Using unicast TCP would give some subscribers data before others, which is unfair and potentially illegal.

Risk Management

Risk management in an exchange operates at multiple layers, from pre-trade checks on individual orders to market-wide circuit breakers.

Pre-Trade Risk Checks

Every order passes through risk validation before matching:

Check	Description	Action on Failure
Buying power	Buyer has sufficient funds (cash + margin)	Reject order
Share availability	Seller has shares to sell (or is permitted to short-sell)	Reject order
Position limits	Order won't exceed maximum allowed position	Reject order
Price collar	Order price is within X% of last trade (fat-finger protection)	Reject order
Order rate limit	Participant hasn't exceeded max orders/second	Throttle or reject
Self-trade prevention	Order won't trade against same firm's resting order	Cancel one side

Circuit Breakers

When prices move too far too fast, circuit breakers halt trading to prevent cascading crashes:

LULD (Limit Up-Limit Down): Individual stock trading halts if the price moves beyond a band calculated from the recent average price. Band width varies by market cap and time of day (5%, 10%, or 20%).
Market-wide circuit breakers: Trading in all stocks halts if the S&P 500 declines by 7% (Level 1), 13% (Level 2), or 20% (Level 3) from the previous day's close.

LULD Circuit Breaker (Example):
  AAPL Reference Price: $150.00
  Band Width: 5%
  Upper Limit: $157.50
  Lower Limit: $142.50

  If AAPL trades at $142.49 → 15-second trading pause
  After pause, new bands calculated
  If still outside bands → extended halt

Kill Switch

Every participant must implement a "kill switch" — the ability to immediately cancel all open orders and halt new order submission. The exchange provides a mass cancel API for this purpose. This is a regulatory requirement (SEC Rule 15c3-5) designed to prevent runaway algorithms from destabilizing the market.

The FIX Protocol

FIX (Financial Information eXchange) is the de facto standard protocol for electronic trading communication. Developed in 1992 by Fidelity Investments and Salomon Brothers, it is now used by virtually every exchange, broker, and institutional trader worldwide.

Message Structure

FIX messages are tag-value pairs separated by a delimiter (SOH character, ASCII 0x01). Each tag is a number defined in the FIX specification:

FIX New Order (Single) Message:
  8=FIX.4.4        (BeginString: protocol version)
  9=176             (BodyLength)
  35=D              (MsgType: D = New Order Single)
  49=CLIENT_A       (SenderCompID)
  56=EXCHANGE       (TargetCompID)
  34=12             (MsgSeqNum)
  52=20260415-09:30:01.234  (SendingTime)
  11=ORD-20260415-001  (ClOrdID: client order ID)
  55=AAPL           (Symbol)
  54=1              (Side: 1=Buy, 2=Sell)
  38=100            (OrderQty)
  40=2              (OrdType: 1=Market, 2=Limit)
  44=150.00         (Price)
  59=0              (TimeInForce: 0=Day)
  10=142            (CheckSum)

Key Message Types

MsgType (Tag 35)	Name	Direction	Purpose
A	Logon	Both	Authenticate and start session
D	New Order Single	Client → Exchange	Submit a new order
F	Order Cancel Request	Client → Exchange	Request to cancel an open order
G	Order Cancel/Replace	Client → Exchange	Modify an open order (price/qty)
8	Execution Report	Exchange → Client	Order accepted, filled, cancelled, rejected
9	Order Cancel Reject	Exchange → Client	Cancel request rejected
0	Heartbeat	Both	Session keepalive
5	Logout	Both	End session gracefully

Session Layer

FIX has a session layer that provides reliability guarantees on top of TCP:

Sequence numbers: Every message has a monotonically increasing sequence number (Tag 34). Both sides track expected sequence numbers. A gap indicates a missed message.
Heartbeats: Periodic heartbeat messages (Type 0) detect connection failures. If no message is received within the heartbeat interval, a Test Request (Type 1) is sent. No response → connection is considered dead.
Resend Request: If a gap is detected (e.g., received seq 15 but expected 13), a Resend Request (Type 2) asks for retransmission of messages 13-14.
Sequence Reset: Used during recovery to skip over administrative messages that don't need retransmission.

FIX Performance Considerations

While FIX is the industry standard, its tag-value text format is relatively slow to parse. High-frequency trading firms often use:

FAST (FIX Adapted for Streaming) — binary encoding of FIX messages with field presence maps and delta compression. 5-10× more compact.
SBE (Simple Binary Encoding) — fixed-length binary format designed for ultra-low-latency. No field delimiters, no tag numbers — just raw bytes at known offsets. Parsing is essentially a pointer cast.
Proprietary binary protocols — some exchanges (e.g., NASDAQ OUCH, CME iLink) use custom binary protocols that are faster than FIX but require exchange-specific integration.

Encoding Comparison (New Order message):
  FIX 4.4 text:     ~200 bytes, parse time ~1-5 μs
  FAST encoded:      ~40 bytes, parse time ~0.5 μs
  SBE encoded:       ~60 bytes, parse time ~0.1 μs
  Proprietary binary: ~30 bytes, parse time <0.1 μs

Event Sourcing & Persistence

A stock exchange is a textbook use case for event sourcing. Instead of storing the current state of the order book, we store every event (order, cancel, trade) as an immutable log entry. The current state is derived by replaying events from the beginning.

Event Log

Every message processed by the exchange is persisted to an append-only event log:

Event Log (Sequenced):
  Seq#   Timestamp            Type     Data
  10001  09:30:00.000123      NEW      BUY 100@$150.10 (OrderID=A1)
  10002  09:30:00.000456      NEW      SELL 50@$150.20 (OrderID=A2)
  10003  09:30:00.001234      NEW      BUY 200@$150.20 (OrderID=A3)
  10004  09:30:00.001234      TRADE    A3 vs A2: 50@$150.20
  10005  09:30:00.001234      PARTIAL  A3: 150 remaining @$150.20
  10006  09:30:00.002345      CANCEL   A1 (BUY 100@$150.10)
  10007  09:30:00.003456      NEW      SELL 100@$150.15 (OrderID=A4)
  10008  09:30:00.003456      TRADE    A3 vs A4: 100@$150.15
  10009  09:30:00.003456      TRADE    ... remaining A3: 50@$150.20 rests

Benefits of Event Sourcing

Complete audit trail: Every order, modification, cancellation, and trade is recorded. Regulators can reconstruct the exact state of the market at any point.
Deterministic replay: Feed the event log into a fresh matching engine → identical order book state. Used for disaster recovery, testing, and debugging.
Time-travel queries: "What was the order book for AAPL at 10:42:35.123?" — replay events up to that timestamp.
Hot standby: A secondary matching engine replays the same event stream in real-time. If the primary fails, the standby has identical state and can take over instantly.

Storage Implementation

The event log has unique requirements:

Append-only: Events are never modified or deleted (regulatory requirement)
Sequential write: All writes are sequential appends — perfect for spinning disks and SSDs alike
High throughput: Must sustain 100K+ writes/second with durability (fsync)
Low latency: Write must complete before the matching engine can proceed (in the critical path)

In practice, exchanges use:

Memory-mapped files with periodic fsync — the OS kernel manages write-back, and the application writes at memory speed
Kernel bypass I/O (io_uring, SPDK) for direct disk access without syscall overhead
Battery-backed DRAM — writes go to battery-backed RAM at DRAM speed, then flushed to disk asynchronously. Provides durability guarantees with RAM latency.
Replication before persist — some exchanges replicate the event to a remote standby before persisting to local disk. This provides network-level durability faster than disk fsync.

Snapshots

Replaying millions of events on startup is slow. To speed up recovery, the exchange periodically takes snapshots of the order book state:

Recovery:
  1. Load latest snapshot (e.g., order book at Seq# 500000)
  2. Replay events from Seq# 500001 onward
  3. State is fully recovered

  Snapshot frequency: every N events or every T seconds
  Snapshot content: full order book state per symbol

Reporting & Clearing

Once a trade executes, it must flow through several downstream systems:

Trade Reporting

The reporting service provides:

Trade confirmations: Sent to buyer and seller immediately after execution
Regulatory reporting: Trades reported to the consolidated tape (SIP) for public dissemination
Audit trail: Complete record for compliance, with CAT (Consolidated Audit Trail) reporting in the US
Trade history: Queryable database for participants to review their execution history

Clearing & Settlement

After a trade is executed (matched), it must be cleared and settled:

Clearing: The clearing house (e.g., DTCC in the US) steps in as the counterparty to both sides, ensuring that if one side defaults, the other is still made whole. The clearing house nets obligations across all trades.
Settlement: Actual exchange of cash and securities. In the US, equity trades settle on T+1 (trade date plus one business day). The DTCC transfers shares from the seller's custodian to the buyer's custodian, and cash in the opposite direction.

Trade Lifecycle:
  T+0 (Trade Date):
    09:30:01 — Order submitted
    09:30:01 — Order matched (trade executes)
    09:30:01 — Execution report sent to both parties
    09:30:01 — Trade reported to SIP/tape
    18:00:00 — End of day: trades sent to clearing house
    
  T+1 (Settlement Date):
    Morning  — Clearing house calculates net obligations
    Afternoon — DTCC settles: shares and cash transfer
    
  Post-Settlement:
    Shares in buyer's account, cash in seller's account
    Trade is final and irrevocable

Latency Optimization Techniques

The difference between a mediocre exchange and a world-class one often comes down to latency engineering. Here are the key techniques used by top exchanges to achieve sub-100-microsecond matching latency.

Hardware & OS Level

Technique	Description	Latency Saving
CPU pinning	Pin the matching engine thread to a dedicated CPU core. Prevents context switches and keeps the working set in L1/L2 cache.	Eliminates ~5-15 μs context switch jitter
NUMA awareness	Ensure the matching engine's memory is allocated on the same NUMA node as its CPU core. Cross-node memory access adds ~100ns per access.	~100ns per memory access saved
Huge pages	Use 2MB or 1GB huge pages to reduce TLB misses. Standard 4KB pages cause frequent TLB evictions for large working sets.	Reduces TLB miss latency
Kernel bypass networking	Use DPDK or Solarflare OpenOnload to bypass the kernel network stack entirely. Packets go directly from NIC to userspace via shared memory.	Saves ~10-50 μs per packet
Busy-polling	Instead of sleeping and being woken by interrupts, the matching engine thread busy-polls the network ring buffer. Wastes CPU but eliminates interrupt latency.	Saves ~5-10 μs interrupt wake-up
Disable frequency scaling	Fix CPU frequency at maximum (disable turbo boost and power saving). Frequency transitions cause latency spikes.	Eliminates ~50-100 μs frequency ramp

Application Level

Lock-free data structures: Use lock-free queues (e.g., LMAX Disruptor ring buffer) for inter-thread communication. No mutex contention, no priority inversion.
Object pooling: Pre-allocate all objects (orders, trades, messages) at startup. Zero allocations in the hot path eliminates GC pauses (for JVM-based systems) or malloc latency.
Cache-friendly data layout: Store order book data in contiguous arrays (struct-of-arrays) rather than pointer-heavy trees (array-of-structs with heap nodes). Cache misses are often the dominant cost.
Branch prediction optimization: Arrange code so the common path (order doesn't cross the book) is predicted correctly. Use likely()/unlikely() compiler hints.
Avoid syscalls: Every syscall (even clock_gettime) costs ~50-100ns due to user-kernel mode switch. Use rdtsc instruction for high-resolution timestamps.

Network Level

Co-location: Place trading servers in the same data center as the exchange (ideally the same rack). Reduces round-trip time from milliseconds to microseconds.
Cut-through switching: Network switches that forward frames as soon as the destination MAC is read (before the entire frame is received). Saves ~1-5 μs per switch hop.
FPGA-based NICs: Process FIX messages and perform risk checks directly in the NIC hardware (Xilinx/Intel FPGAs). The message is matched before it even reaches the CPU.
Equal-length cables: All participants' cables to the exchange are the same length, ensuring no one has a latency advantage from being physically closer. This is a regulatory fairness requirement.

The LMAX Disruptor Pattern

The LMAX Disruptor is a high-performance inter-thread messaging pattern, originally developed for the LMAX exchange. It replaces traditional queues with a pre-allocated ring buffer:

LMAX Disruptor Ring Buffer:
  ┌───┬───┬───┬───┬───┬───┬───┬───┐
  │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ ← Pre-allocated slots
  └───┴───┴───┴───┴───┴───┴───┴───┘
    ▲                       ▲
    │                       │
  Consumer                Producer
  (Matching Engine)       (Sequencer)

Key properties:
  ✓ Lock-free — uses CAS (compare-and-swap) for coordination
  ✓ Pre-allocated — no GC pressure, no allocation latency
  ✓ Cache-line aligned — each slot on its own cache line
  ✓ Mechanical sympathy — designed for how CPUs actually work
  
Performance:
  ✓ 100M+ messages/sec single-threaded
  ✓ Sub-microsecond latency per message
  ✓ No contention, no locks, no syscalls

LMAX claimed their exchange could process 6 million orders per second on a single thread, with average latency under 1 microsecond. The key insight: a single well-optimized thread is faster than a multi-threaded system with locks, because lock contention and cache coherency traffic dominate at these scales.

Failure Modes & Recovery

An exchange must handle failures gracefully without losing or duplicating orders.

Gateway Failure

If a gateway crashes, clients reconnect to another gateway. FIX session recovery (resend request) ensures no messages are lost. The sequencer has already logged all messages from the failed gateway, so no state is lost.

Sequencer Failure

The hot standby sequencer takes over. It has replicated all messages up to the last confirmed sequence number. New messages receive the next sequence number. Matching engine sees a continuous stream.

Sequencer Failover:
  Primary:  Seq# 10001 ... 10042 ... 10043 [CRASH]
  Standby:  Seq# 10001 ... 10042 (replicated)
  Standby promotes to primary, continues from Seq# 10043
  
  Gap risk: if 10043 was assigned but not replicated
  Mitigation: wait for in-flight messages to drain before
              accepting new orders (brief pause)

Matching Engine Failure

The matching engine is deterministic. Recovery is straightforward:

Start a new matching engine instance
Load the latest snapshot
Replay events from the event log after the snapshot
Order book is reconstructed to the exact pre-failure state
Resume processing new events from the sequencer

If a hot standby was running (replaying the same event stream in real-time), failover is nearly instantaneous — the standby already has the correct state.

Network Partition

Unlike typical distributed systems, an exchange does not face the CAP theorem dilemma in its core matching logic. The matching engine is a single node — there is no distributed state to partition. Network partitions affect:

Client connectivity: Clients lose connection to gateways. They reconnect when the partition heals. Unmatched orders remain in the book.
Market data distribution: Subscribers miss updates. Sequence numbers detect gaps; recovery servers fill them.

Complete Order Flow

Let's trace a single order through the entire system to solidify the architecture:

Trader at FirmA submits: BUY 100 AAPL @ $150.10 (Limit)

1. CLIENT → GATEWAY
   FirmA's trading system sends a FIX New Order Single (35=D)
   over a persistent TCP connection to Gateway #2.
   Gateway validates: format ✓, symbol ✓, auth ✓
   Time: ~5 μs (network) + ~2 μs (validation)

2. GATEWAY → SEQUENCER
   Gateway forwards the validated order to the Sequencer.
   Sequencer assigns Seq# 50042, persists to event log.
   Time: ~1 μs (sequencing) + ~2 μs (persist)

3. SEQUENCER → RISK ENGINE
   Risk engine checks: buying power ✓, position limit ✓,
   price collar ✓ ($150.10 is within 5% of last trade $150.05)
   Time: ~3 μs

4. SEQUENCER → MATCHING ENGINE (AAPL)
   The AAPL matching engine receives the order.
   Current order book:
     Best Ask: $150.05 × 200 (OrderID=B7, 200 shares)
   
   Crossing condition: Buy $150.10 >= Sell $150.05? YES!
   Match: 100 shares @ $150.05 (resting order's price)
   Buyer gets price improvement: saved $0.05/share = $5.00
   Time: ~0.5 μs (matching logic)

5. MATCHING ENGINE → OUTPUTS
   a) Trade report: BUY A1 vs SELL B7, 100@$150.05
   b) Execution report to FirmA: FILLED 100@$150.05
   c) Execution report to FirmB: PARTIAL FILL 100@$150.05
      (B7 has 100 shares remaining)
   d) Market data update: new best ask, last trade
   Time: ~1 μs (message generation)

6. MARKET DATA PUBLISHER → ALL SUBSCRIBERS
   Multicast UDP: Last trade AAPL 100@$150.05
   Multicast UDP: New top of book
   Time: ~1-5 μs (multicast)

TOTAL: ~15-20 μs gateway-to-gateway (co-located)
       Well under the 1ms target.

Scaling Considerations

Horizontal Scaling via Symbol Partitioning

The primary scaling mechanism is symbol partitioning. Each symbol has its own matching engine running on a dedicated CPU core. A modern server with 64 cores can run 64 independent matching engines.

Scaling Model:
  Server 1 (64 cores): AAPL, GOOG, MSFT, ... (64 symbols)
  Server 2 (64 cores): AMZN, TSLA, META, ... (64 symbols)
  Server 3 (64 cores): JPM, BAC, WFC, ...    (64 symbols)
  ...

  For 8,000 listed symbols at ~64 symbols/server:
  ~125 servers for matching engines

  Most symbols are low-volume. The top 50 symbols generate
  ~80% of order flow. High-volume symbols get dedicated servers.

Gateway Scaling

Gateways scale horizontally. Add more gateways to handle more client connections. Load balancing is done at the connection level — each client session is assigned to one gateway.

Market Data Scaling

Market data is the highest-bandwidth component. During peak activity, an exchange generates millions of market data messages per second. Scaling strategies:

Symbol-based multicast groups: Different symbols on different multicast addresses. Subscribers only join groups for symbols they care about.
Tiered data: Level 1 on one channel, Level 2 on another, Level 3 on a third. Subscribers choose their tier.
Conflation: For slow consumers, aggregate multiple updates into a single message (e.g., send the latest state rather than every intermediate change).

Key Design Decisions Summary

Decision	Choice	Rationale
Matching engine threading	Single-threaded per symbol	Eliminates locks, maximizes cache locality, enables deterministic replay
Scaling strategy	Symbol partitioning	Orders for the same symbol must be processed sequentially; different symbols are independent
Sequencing	Central sequencer with replication	Single source of truth for event ordering; enables deterministic replay
Persistence	Event sourcing with snapshots	Complete audit trail + fast recovery
Market data transport	Multicast UDP	Low latency, fair (simultaneous) delivery, efficient for one-to-many
Client protocol	FIX (with binary variants for speed)	Industry standard, broad client support
Matching algorithm	Price-time priority	Fair, simple, widely understood; standard for equity markets

Final Summary

Designing a stock exchange is one of the most constrained system design problems: ultra-low latency, perfect correctness, deterministic processing, and complete auditability — all at massive scale. The key architectural principles are:

Single-threaded matching engines per symbol — avoid distributed state; a single thread with data in cache is faster than any distributed system.
Central sequencer — create a single, globally ordered stream of events. This enables deterministic replay, hot standby failover, and regulatory audit.
Event sourcing — log every event immutably. The order book is a derived view, reconstructable from the log at any point in time.
Multicast UDP for market data — fair, low-latency, one-to-many distribution. Sequence numbers handle reliability.
FIX protocol — the industry standard for order management. Binary variants (SBE, FAST) for ultra-low-latency paths.
Hardware optimization — CPU pinning, kernel bypass networking, huge pages, NUMA awareness, busy-polling. Every microsecond counts.
Risk management at every layer — pre-trade checks, circuit breakers, kill switches. The exchange must protect itself and the market from errant algorithms.

The fundamental insight: An exchange is not a typical distributed system. The core matching logic is intentionally centralized — a single-threaded, deterministic state machine processing a totally ordered event stream. Distributed systems techniques (consensus, eventual consistency, sharding) are used for surrounding infrastructure (gateways, market data, reporting), but the matching engine itself is as simple and fast as possible.