← All Posts
High Level Design Series · Real-World Designs· Post 60 of 70

Design: Stock Exchange

Introduction

A stock exchange is the backbone of global financial markets. Every day, billions of dollars change hands through electronic exchanges like the NYSE, NASDAQ, and the London Stock Exchange. At its heart, an exchange does something deceptively simple: it matches buyers with sellers. But doing this at sub-millisecond latency, for millions of orders per day, with perfect accuracy and complete auditability, is one of the most demanding system design challenges in existence.

Modern exchanges process peak throughputs of over 100,000 messages per second with median latencies under 50 microseconds. A single bit flip, a single out-of-order message, or a single lost trade can trigger regulatory investigations and cost millions. There is no room for "eventual consistency" in an exchange — every order, every match, every cancellation must be deterministic, auditable, and exactly-once.

In this post, we'll design a stock exchange from scratch, covering order types, order book mechanics, the matching engine, sequencing, market data distribution, risk management, and the FIX protocol. We'll also dive deep into the latency optimization techniques that separate real trading systems from typical distributed systems.

Scope: We're designing a securities exchange — an electronic venue where participants place orders to buy or sell financial instruments (stocks, ETFs, options). We will focus on equities (stocks) for clarity, but the architecture generalizes to other instrument types.

Requirements

Functional Requirements

  1. Place orders — participants can submit buy and sell orders (market, limit, stop) for listed instruments
  2. Cancel / modify orders — participants can cancel open orders or modify quantity/price
  3. Match buy and sell orders — the engine matches compatible orders using price-time priority
  4. Market data feed — broadcast real-time price updates, order book depth, and trade reports to all subscribers
  5. Trade reporting — generate confirmations for matched trades and report to clearing/settlement systems
  6. Risk checks — pre-trade validation of account balances, position limits, and regulatory constraints
  7. Support multiple instruments — the exchange lists thousands of symbols (e.g., AAPL, GOOG, TSLA)

Non-Functional Requirements

MetricTarget
Matching latency (p99)< 1 ms (top exchanges target < 100 μs)
ThroughputMillions of orders/day (100K+ messages/sec peak)
Availability99.99% during market hours (6.5 hrs/day for US equities)
CorrectnessZero tolerance for lost orders, duplicate executions, or reordering
FairnessAll participants see the same data at the same time; FIFO ordering at each price level
AuditabilityEvery event must be logged and replayable for regulatory compliance
Why sub-millisecond matters: In modern markets, a 1 ms advantage translates directly to profit. High-frequency trading (HFT) firms invest hundreds of millions in infrastructure to shave microseconds off round-trip times. The exchange itself must be faster than its fastest participants, or it becomes the bottleneck.

Order Types

Before designing the matching engine, we need to understand the different kinds of orders that participants can place. Each order type has different matching semantics and edge cases.

Market Order

A market order says: "I want to buy (or sell) N shares immediately at the best available price." Market orders prioritize execution speed over price. They are guaranteed to execute (as long as there is liquidity on the other side), but the execution price may be worse than expected if the order book is thin.

Market Buy Order:
  Symbol:   AAPL
  Side:     BUY
  Quantity: 100 shares
  Price:    N/A (take the best available ask price)

Execution: Immediately matches against the lowest-priced
           sell orders in the order book.

Risk: In illiquid markets, a large market order can "walk the book" — consuming multiple price levels and resulting in a much worse average execution price than expected. This is called slippage.

Limit Order

A limit order says: "I want to buy (or sell) N shares at price P or better." A limit buy order will only execute at price ≤ P. A limit sell order will only execute at price ≥ P. If no matching counterparty exists, the order rests in the order book until filled, cancelled, or expired.

Limit Buy Order:
  Symbol:   AAPL
  Side:     BUY
  Quantity: 100 shares
  Price:    $150.00 (max price willing to pay)

Execution: Matches against sell orders priced ≤ $150.00.
           If no match, rests in the order book at $150.00.

Limit Sell Order:
  Symbol:   AAPL
  Side:     SELL
  Quantity: 50 shares
  Price:    $151.00 (min price willing to accept)

Execution: Matches against buy orders priced ≥ $151.00.
           If no match, rests in the order book at $151.00.

Limit orders form the backbone of the order book. The vast majority of orders on any exchange are limit orders — they provide liquidity by indicating willingness to trade at specific prices.

Stop Order

A stop order is a conditional order: "When the market price reaches P, convert this into a market order." Stop orders are used for risk management — a stop-loss sell order, for example, automatically sells a position if the price drops below a threshold.

Stop-Loss Sell Order:
  Symbol:    AAPL
  Side:      SELL
  Quantity:  100 shares
  Stop Price: $145.00
  
Behavior:  When AAPL trades at or below $145.00, this
           becomes a market sell order and executes
           at the best available bid price.

Stop-Limit Sell Order:
  Symbol:    AAPL
  Side:      SELL
  Quantity:  100 shares
  Stop Price: $145.00
  Limit Price: $144.50

Behavior:  When AAPL trades at or below $145.00, this
           becomes a limit sell order at $144.50.

Time-in-Force Qualifiers

Every order also has a time-in-force (TIF) qualifier that specifies how long it remains active:

Additional Order Types

The Order Book

The order book is the central data structure of any exchange. It maintains all outstanding (resting) limit orders for a given instrument, organized by price and arrival time. There is one order book per symbol.

Structure

An order book has two sides:

Order Book: AAPL
═══════════════════════════════════════════════
         BIDS (Buy)      │     ASKS (Sell)
─────────────────────────┼─────────────────────
  Price    Qty    Orders  │  Price    Qty    Orders
 $150.10   300   [3 ord]  │ $150.20   150   [2 ord]
 $150.00   500   [5 ord]  │ $150.30   200   [1 ord]
 $149.90   200   [2 ord]  │ $150.50   400   [4 ord]
 $149.80   100   [1 ord]  │ $150.60   100   [1 ord]
═══════════════════════════════════════════════
  Best Bid: $150.10       │ Best Ask: $150.20
  Spread: $0.10 (0.07%)
═══════════════════════════════════════════════

The spread is the difference between the best bid and the best ask. A tight spread indicates a liquid market with active trading. The spread is never negative on a well-functioning exchange — if the best bid exceeds the best ask, those orders would immediately match.

Price Levels

At each price level, multiple orders may exist. Within a price level, orders are arranged in FIFO (first-in, first-out) order — the order that arrived first gets filled first. This combination of price priority and time priority is called price-time priority (or strict FIFO matching).

Price Level: $150.10 (Buy side)
───────────────────────────────────────
  Seq#   Time       Qty   Trader
  1042   09:30:01   100   FirmA
  1087   09:30:02   150   FirmB
  1103   09:30:03    50   FirmC
───────────────────────────────────────
  Total: 300 shares across 3 orders

When a sell order arrives that matches at $150.10:
  → FirmA's order fills first (arrived earliest)
  → Then FirmB, then FirmC

Data Structure Implementation

The order book requires efficient operations on three axes:

OperationFrequencyTarget
Find best bid/askEvery incoming orderO(1)
Add order at price levelEvery new limit orderO(1) amortized
Remove order by IDEvery cancel/fillO(1)
Walk price levels (matching)Every crossing orderO(levels consumed)
Remove empty price levelWhen last order at a level fillsO(log P)

A common implementation uses:

struct OrderBook {
    bids: BTreeMap<Price, PriceLevel>,  // descending
    asks: BTreeMap<Price, PriceLevel>,  // ascending
    orders: HashMap<OrderId, *mut OrderNode>,
}

struct PriceLevel {
    price: Price,
    total_quantity: u64,
    head: *mut OrderNode,   // first order (FIFO front)
    tail: *mut OrderNode,   // last order (FIFO back)
}

struct OrderNode {
    order_id: OrderId,
    quantity: u64,
    timestamp: u64,
    prev: *mut OrderNode,
    next: *mut OrderNode,
}
Why not a heap? A min-heap/max-heap seems natural for "get best price," but heaps don't support O(1) removal of arbitrary elements (needed for cancel operations). They also don't support walking levels in order efficiently. A sorted tree provides the best balance of all operations.

The Matching Engine

The matching engine is the heart of the exchange. It receives orders, applies them to the order book, and generates trades (executions) when buy and sell orders are compatible. The matching engine for a given symbol is a single-threaded, deterministic state machine — this is a critical design decision.

Price-Time Priority Algorithm

The most common matching algorithm is price-time priority (also called strict FIFO). The rules are simple:

  1. Price priority: The best-priced order on the opposite side gets matched first. For a buy, this means the lowest-priced sell. For a sell, this means the highest-priced buy.
  2. Time priority: Among orders at the same price, the order that arrived first gets matched first (FIFO).
  3. Crossing condition: A buy order matches with a sell order when the buy price ≥ the sell price.

Matching Algorithm (Pseudocode)

function match(incoming_order, order_book):
    if incoming_order.side == BUY:
        opposite_book = order_book.asks    // ascending price
    else:
        opposite_book = order_book.bids    // descending price

    remaining_qty = incoming_order.quantity

    while remaining_qty > 0 and opposite_book is not empty:
        best_level = opposite_book.first()  // best price level

        // Check crossing condition
        if incoming_order.side == BUY:
            if incoming_order is MARKET or incoming_order.price >= best_level.price:
                // Prices cross — match!
            else:
                break  // No match possible
        else:  // SELL
            if incoming_order is MARKET or incoming_order.price <= best_level.price:
                // Prices cross — match!
            else:
                break

        // Walk the FIFO queue at this price level
        while remaining_qty > 0 and best_level is not empty:
            resting_order = best_level.head  // oldest order

            fill_qty = min(remaining_qty, resting_order.quantity)
            fill_price = resting_order.price  // resting order's price

            // Generate execution report
            emit Trade(
                buy_order_id,
                sell_order_id,
                symbol,
                fill_qty,
                fill_price,
                timestamp
            )

            remaining_qty -= fill_qty
            resting_order.quantity -= fill_qty

            if resting_order.quantity == 0:
                remove resting_order from level and order map

        if best_level is empty:
            remove best_level from opposite_book

    // Handle remaining quantity
    if remaining_qty > 0:
        if incoming_order is LIMIT:
            add incoming_order (with remaining_qty) to same-side book
        else:  // MARKET order with unfilled quantity
            cancel remaining (or reject per exchange policy)

    return trades

Execution Price Rule

When two orders match, which price is used? The standard rule is: the resting order's price. The order that was already in the book (passive) determines the trade price, not the incoming (aggressive) order. This incentivizes providing liquidity — your limit order will fill at the price you specified, never worse.

Example: Buy limit order at $50.10 arrives. The best ask is $50.00. The trade executes at $50.00 (the resting order's price), which is actually better than the buyer expected. This is called price improvement.

Alternative Matching Algorithms

While price-time priority is the most common, some venues use alternative algorithms:

▶ Order Book Matching

Watch a market buy order for 150 shares match against the sell side of the order book using price-time priority.

The Sequencer

In a stock exchange, ordering is everything. If two participants submit orders at nearly the same time, who gets priority? The answer must be deterministic, reproducible, and fair. This is the sequencer's job.

Why We Need a Sequencer

Consider the problem: multiple gateway servers receive orders from different participants over the network. Each gateway has a slightly different local clock. Network jitter means packets arrive at different times. Without a central sequencing mechanism, there is no single truth about the order of events.

The sequencer solves this by assigning a monotonically increasing sequence number to every incoming message before it reaches the matching engine. This creates a single, totally ordered stream of events.

Without Sequencer:
  Gateway A receives Order X at T=09:30:00.001234
  Gateway B receives Order Y at T=09:30:00.001235
  Which arrived first? Depends on clock sync accuracy.

With Sequencer:
  Sequencer assigns Order X → Seq #10042
  Sequencer assigns Order Y → Seq #10043
  Order X is processed first. Period. Deterministic.

Sequencer Design

The sequencer is a simple but critical component:

  1. Receives messages from all gateways
  2. Assigns a globally unique, monotonically increasing sequence number
  3. Writes the sequenced message to a durable log
  4. Forwards the sequenced message to the matching engine

The sequencer must be single-threaded — a global sequence number requires a single point of serialization. This sounds like a bottleneck, but in practice a well-optimized sequencer can stamp millions of messages per second on modern hardware.

Deterministic Replay

The sequencer enables one of the most powerful properties of exchange architecture: deterministic replay. Since every message has a sequence number and the matching engine is a deterministic state machine, you can replay the entire sequence of events and reconstruct the exact state of the order book at any point in time.

Replay Guarantee:
  Given the same sequence of inputs (with sequence numbers),
  the matching engine will produce the exact same sequence
  of outputs (trades, order book states).

This enables:
  ✓ Disaster recovery — replay from the log to rebuild state
  ✓ Regulatory audit — reconstruct the market at any timestamp
  ✓ Testing — replay production sequences in test environments
  ✓ Hot standby — a secondary matching engine replays the
    same stream and can take over instantly on failure

High Availability

The sequencer is a single point of failure (by design — it must be single-threaded). To achieve high availability:

▶ Sequencer Flow

Multiple gateways receive orders → Sequencer assigns global sequence numbers → Matching Engine processes in order → deterministic replay possible.

High-Level Architecture

The end-to-end architecture of a stock exchange follows a strict pipeline:

                          ┌──────────────┐
   Participants ──────────▶│   Gateways    │
   (FIX clients)          │ (Validate &   │
                          │  Authenticate) │
                          └──────┬───────┘
                                 │
                                 ▼
                          ┌──────────────┐
                          │  Sequencer   │
                          │ (Assign seq# │
                          │  + persist)  │
                          └──────┬───────┘
                                 │
                          ┌──────┴───────┐
                          │              │
                          ▼              ▼
                   ┌────────────┐ ┌────────────┐
                   │ Risk Check │ │  Matching   │
                   │  Engine    │ │  Engine     │
                   │(Pre-trade) │ │(Per-symbol) │
                   └────────────┘ └──────┬─────┘
                                         │
                          ┌──────────────┼──────────────┐
                          ▼              ▼              ▼
                   ┌────────────┐ ┌────────────┐ ┌────────────┐
                   │  Market    │ │  Trade     │ │ Reporting  │
                   │  Data Pub  │ │ Confirm    │ │ / Clearing │
                   └────────────┘ └────────────┘ └────────────┘

Component Breakdown

1. Gateway

The gateway is the entry point for all client connections. It handles:

Multiple gateway instances run in parallel for scalability. Each client connects to one gateway, but gateways are stateless — clients can reconnect to any gateway after a failure.

2. Sequencer

As described above, the sequencer assigns a global ordering to all messages. It sits between the gateways and the matching engine, creating a single stream of sequenced events.

3. Risk Check Engine

Before an order reaches the matching engine, it must pass pre-trade risk checks:

Risk checks must be fast — they're in the critical path. Some exchanges perform risk checks before sequencing (at the gateway), others after sequencing but before matching. The trade-off is between latency and consistency.

4. Matching Engine

The matching engine runs the price-time priority algorithm described earlier. A critical design decision is: one matching engine instance per symbol. This avoids the need for distributed state, locks, or consensus within the matching logic.

Symbol Partitioning:
  AAPL  → Matching Engine #1 (Single thread)
  GOOG  → Matching Engine #2 (Single thread)
  TSLA  → Matching Engine #3 (Single thread)
  AMZN  → Matching Engine #4 (Single thread)
  ...

Each matching engine:
  ✓ Single-threaded — no locks, no contention
  ✓ Deterministic — same inputs → same outputs
  ✓ In-memory — entire order book in RAM
  ✓ Pinned to a CPU core — avoids context switches
Why single-threaded? Matching is inherently sequential — Order B's outcome depends on Order A's execution. Multi-threading would require locks on the order book, causing contention that increases latency. A single-threaded loop on a dedicated CPU core with the order book in L1/L2 cache achieves the lowest possible latency.

5. Market Data Publisher

After every match or order book change, the market data publisher broadcasts updates to all subscribers. This component is discussed in detail below.

6. Trade Confirmation & Reporting

Executed trades generate confirmations sent back to both buyer and seller. Trade reports are also sent to:

Market Data

Market data is the real-time feed of price and order information that the exchange broadcasts to all participants. It's how traders know what's happening in the market. There are different levels of market data, each with increasing detail.

Level 1 Data (Top of Book)

Level 1 provides the most basic information:

Level 1 Market Data (AAPL):
  Best Bid:   $150.10 × 300
  Best Ask:   $150.20 × 150
  Last Trade: $150.15 × 50
  Volume:     12,345,678
  High:       $151.20
  Low:        $149.80
  Open:       $150.00

Level 2 Data (Depth of Book)

Level 2 shows the full order book depth — all price levels with their aggregate quantities:

Level 2 Market Data (AAPL):
  BIDS                      ASKS
  $150.10  300  (3 orders)  $150.20  150  (2 orders)
  $150.00  500  (5 orders)  $150.30  200  (1 order)
  $149.90  200  (2 orders)  $150.50  400  (4 orders)
  $149.80  100  (1 order)   $150.60  100  (1 order)
  $149.70  350  (4 orders)  $150.70  250  (3 orders)

Level 2 data is essential for algorithmic trading and market making. By seeing the full depth, traders can estimate how much liquidity is available at each price and predict how their orders will fill.

Level 3 Data (Full Order-by-Order)

Level 3 provides every individual order in the book (not just aggregated levels). This is the most granular data:

Level 3 Feed Messages:
  ADD   OrderID=1042  BUY  100@$150.10  09:30:01.234
  ADD   OrderID=1043  BUY  150@$150.10  09:30:01.456
  ADD   OrderID=1044  SELL  50@$150.20  09:30:01.789
  TRADE OrderID=1045 vs 1044  50@$150.20
  CANCEL OrderID=1042  BUY 100@$150.10

With Level 3 data, a subscriber can reconstruct the exact order book locally, including individual order positions within each price level.

Feed Handlers & Multicast UDP

Market data must be delivered to thousands of subscribers simultaneously with minimal latency. The standard approach uses IP multicast over UDP:

Market Data Distribution:
  
  Matching Engine → Market Data Publisher
                         │
                    Multicast UDP
                    (239.1.2.3:50001)
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          Subscriber  Subscriber  Subscriber
          (FirmA)     (FirmB)     (FirmC)
          
  Gap detected? → Request retransmit via TCP recovery

  Multicast Benefits:
    ✓ Fair — all subscribers receive at the same time
    ✓ Efficient — one send, many receivers
    ✓ Low latency — no TCP handshake or ACK overhead
    
  Multicast Risks:
    ✗ Packet loss (mitigated by sequence numbers + recovery)
    ✗ Out-of-order (mitigated by sequence numbers)
    ✗ Network congestion (mitigated by rate limiting)
Fairness requirement: Regulators require that all participants receive market data at the same time. Multicast inherently provides this — the network delivers the same packet to everyone simultaneously. Using unicast TCP would give some subscribers data before others, which is unfair and potentially illegal.

Risk Management

Risk management in an exchange operates at multiple layers, from pre-trade checks on individual orders to market-wide circuit breakers.

Pre-Trade Risk Checks

Every order passes through risk validation before matching:

CheckDescriptionAction on Failure
Buying powerBuyer has sufficient funds (cash + margin)Reject order
Share availabilitySeller has shares to sell (or is permitted to short-sell)Reject order
Position limitsOrder won't exceed maximum allowed positionReject order
Price collarOrder price is within X% of last trade (fat-finger protection)Reject order
Order rate limitParticipant hasn't exceeded max orders/secondThrottle or reject
Self-trade preventionOrder won't trade against same firm's resting orderCancel one side

Circuit Breakers

When prices move too far too fast, circuit breakers halt trading to prevent cascading crashes:

LULD Circuit Breaker (Example):
  AAPL Reference Price: $150.00
  Band Width: 5%
  Upper Limit: $157.50
  Lower Limit: $142.50

  If AAPL trades at $142.49 → 15-second trading pause
  After pause, new bands calculated
  If still outside bands → extended halt

Kill Switch

Every participant must implement a "kill switch" — the ability to immediately cancel all open orders and halt new order submission. The exchange provides a mass cancel API for this purpose. This is a regulatory requirement (SEC Rule 15c3-5) designed to prevent runaway algorithms from destabilizing the market.

The FIX Protocol

FIX (Financial Information eXchange) is the de facto standard protocol for electronic trading communication. Developed in 1992 by Fidelity Investments and Salomon Brothers, it is now used by virtually every exchange, broker, and institutional trader worldwide.

Message Structure

FIX messages are tag-value pairs separated by a delimiter (SOH character, ASCII 0x01). Each tag is a number defined in the FIX specification:

FIX New Order (Single) Message:
  8=FIX.4.4        (BeginString: protocol version)
  9=176             (BodyLength)
  35=D              (MsgType: D = New Order Single)
  49=CLIENT_A       (SenderCompID)
  56=EXCHANGE       (TargetCompID)
  34=12             (MsgSeqNum)
  52=20260415-09:30:01.234  (SendingTime)
  11=ORD-20260415-001  (ClOrdID: client order ID)
  55=AAPL           (Symbol)
  54=1              (Side: 1=Buy, 2=Sell)
  38=100            (OrderQty)
  40=2              (OrdType: 1=Market, 2=Limit)
  44=150.00         (Price)
  59=0              (TimeInForce: 0=Day)
  10=142            (CheckSum)

Key Message Types

MsgType (Tag 35)NameDirectionPurpose
ALogonBothAuthenticate and start session
DNew Order SingleClient → ExchangeSubmit a new order
FOrder Cancel RequestClient → ExchangeRequest to cancel an open order
GOrder Cancel/ReplaceClient → ExchangeModify an open order (price/qty)
8Execution ReportExchange → ClientOrder accepted, filled, cancelled, rejected
9Order Cancel RejectExchange → ClientCancel request rejected
0HeartbeatBothSession keepalive
5LogoutBothEnd session gracefully

Session Layer

FIX has a session layer that provides reliability guarantees on top of TCP:

FIX Performance Considerations

While FIX is the industry standard, its tag-value text format is relatively slow to parse. High-frequency trading firms often use:

Encoding Comparison (New Order message):
  FIX 4.4 text:     ~200 bytes, parse time ~1-5 μs
  FAST encoded:      ~40 bytes, parse time ~0.5 μs
  SBE encoded:       ~60 bytes, parse time ~0.1 μs
  Proprietary binary: ~30 bytes, parse time <0.1 μs

Event Sourcing & Persistence

A stock exchange is a textbook use case for event sourcing. Instead of storing the current state of the order book, we store every event (order, cancel, trade) as an immutable log entry. The current state is derived by replaying events from the beginning.

Event Log

Every message processed by the exchange is persisted to an append-only event log:

Event Log (Sequenced):
  Seq#   Timestamp            Type     Data
  10001  09:30:00.000123      NEW      BUY 100@$150.10 (OrderID=A1)
  10002  09:30:00.000456      NEW      SELL 50@$150.20 (OrderID=A2)
  10003  09:30:00.001234      NEW      BUY 200@$150.20 (OrderID=A3)
  10004  09:30:00.001234      TRADE    A3 vs A2: 50@$150.20
  10005  09:30:00.001234      PARTIAL  A3: 150 remaining @$150.20
  10006  09:30:00.002345      CANCEL   A1 (BUY 100@$150.10)
  10007  09:30:00.003456      NEW      SELL 100@$150.15 (OrderID=A4)
  10008  09:30:00.003456      TRADE    A3 vs A4: 100@$150.15
  10009  09:30:00.003456      TRADE    ... remaining A3: 50@$150.20 rests

Benefits of Event Sourcing

Storage Implementation

The event log has unique requirements:

In practice, exchanges use:

Snapshots

Replaying millions of events on startup is slow. To speed up recovery, the exchange periodically takes snapshots of the order book state:

Recovery:
  1. Load latest snapshot (e.g., order book at Seq# 500000)
  2. Replay events from Seq# 500001 onward
  3. State is fully recovered

  Snapshot frequency: every N events or every T seconds
  Snapshot content: full order book state per symbol

Reporting & Clearing

Once a trade executes, it must flow through several downstream systems:

Trade Reporting

The reporting service provides:

Clearing & Settlement

After a trade is executed (matched), it must be cleared and settled:

  1. Clearing: The clearing house (e.g., DTCC in the US) steps in as the counterparty to both sides, ensuring that if one side defaults, the other is still made whole. The clearing house nets obligations across all trades.
  2. Settlement: Actual exchange of cash and securities. In the US, equity trades settle on T+1 (trade date plus one business day). The DTCC transfers shares from the seller's custodian to the buyer's custodian, and cash in the opposite direction.
Trade Lifecycle:
  T+0 (Trade Date):
    09:30:01 — Order submitted
    09:30:01 — Order matched (trade executes)
    09:30:01 — Execution report sent to both parties
    09:30:01 — Trade reported to SIP/tape
    18:00:00 — End of day: trades sent to clearing house
    
  T+1 (Settlement Date):
    Morning  — Clearing house calculates net obligations
    Afternoon — DTCC settles: shares and cash transfer
    
  Post-Settlement:
    Shares in buyer's account, cash in seller's account
    Trade is final and irrevocable

Latency Optimization Techniques

The difference between a mediocre exchange and a world-class one often comes down to latency engineering. Here are the key techniques used by top exchanges to achieve sub-100-microsecond matching latency.

Hardware & OS Level

TechniqueDescriptionLatency Saving
CPU pinningPin the matching engine thread to a dedicated CPU core. Prevents context switches and keeps the working set in L1/L2 cache.Eliminates ~5-15 μs context switch jitter
NUMA awarenessEnsure the matching engine's memory is allocated on the same NUMA node as its CPU core. Cross-node memory access adds ~100ns per access.~100ns per memory access saved
Huge pagesUse 2MB or 1GB huge pages to reduce TLB misses. Standard 4KB pages cause frequent TLB evictions for large working sets.Reduces TLB miss latency
Kernel bypass networkingUse DPDK or Solarflare OpenOnload to bypass the kernel network stack entirely. Packets go directly from NIC to userspace via shared memory.Saves ~10-50 μs per packet
Busy-pollingInstead of sleeping and being woken by interrupts, the matching engine thread busy-polls the network ring buffer. Wastes CPU but eliminates interrupt latency.Saves ~5-10 μs interrupt wake-up
Disable frequency scalingFix CPU frequency at maximum (disable turbo boost and power saving). Frequency transitions cause latency spikes.Eliminates ~50-100 μs frequency ramp

Application Level

Network Level

The LMAX Disruptor Pattern

The LMAX Disruptor is a high-performance inter-thread messaging pattern, originally developed for the LMAX exchange. It replaces traditional queues with a pre-allocated ring buffer:

LMAX Disruptor Ring Buffer:
  ┌───┬───┬───┬───┬───┬───┬───┬───┐
  │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ ← Pre-allocated slots
  └───┴───┴───┴───┴───┴───┴───┴───┘
    ▲                       ▲
    │                       │
  Consumer                Producer
  (Matching Engine)       (Sequencer)

Key properties:
  ✓ Lock-free — uses CAS (compare-and-swap) for coordination
  ✓ Pre-allocated — no GC pressure, no allocation latency
  ✓ Cache-line aligned — each slot on its own cache line
  ✓ Mechanical sympathy — designed for how CPUs actually work
  
Performance:
  ✓ 100M+ messages/sec single-threaded
  ✓ Sub-microsecond latency per message
  ✓ No contention, no locks, no syscalls
LMAX claimed their exchange could process 6 million orders per second on a single thread, with average latency under 1 microsecond. The key insight: a single well-optimized thread is faster than a multi-threaded system with locks, because lock contention and cache coherency traffic dominate at these scales.

Failure Modes & Recovery

An exchange must handle failures gracefully without losing or duplicating orders.

Gateway Failure

If a gateway crashes, clients reconnect to another gateway. FIX session recovery (resend request) ensures no messages are lost. The sequencer has already logged all messages from the failed gateway, so no state is lost.

Sequencer Failure

The hot standby sequencer takes over. It has replicated all messages up to the last confirmed sequence number. New messages receive the next sequence number. Matching engine sees a continuous stream.

Sequencer Failover:
  Primary:  Seq# 10001 ... 10042 ... 10043 [CRASH]
  Standby:  Seq# 10001 ... 10042 (replicated)
  Standby promotes to primary, continues from Seq# 10043
  
  Gap risk: if 10043 was assigned but not replicated
  Mitigation: wait for in-flight messages to drain before
              accepting new orders (brief pause)

Matching Engine Failure

The matching engine is deterministic. Recovery is straightforward:

  1. Start a new matching engine instance
  2. Load the latest snapshot
  3. Replay events from the event log after the snapshot
  4. Order book is reconstructed to the exact pre-failure state
  5. Resume processing new events from the sequencer

If a hot standby was running (replaying the same event stream in real-time), failover is nearly instantaneous — the standby already has the correct state.

Network Partition

Unlike typical distributed systems, an exchange does not face the CAP theorem dilemma in its core matching logic. The matching engine is a single node — there is no distributed state to partition. Network partitions affect:

Complete Order Flow

Let's trace a single order through the entire system to solidify the architecture:

Trader at FirmA submits: BUY 100 AAPL @ $150.10 (Limit)

1. CLIENT → GATEWAY
   FirmA's trading system sends a FIX New Order Single (35=D)
   over a persistent TCP connection to Gateway #2.
   Gateway validates: format ✓, symbol ✓, auth ✓
   Time: ~5 μs (network) + ~2 μs (validation)

2. GATEWAY → SEQUENCER
   Gateway forwards the validated order to the Sequencer.
   Sequencer assigns Seq# 50042, persists to event log.
   Time: ~1 μs (sequencing) + ~2 μs (persist)

3. SEQUENCER → RISK ENGINE
   Risk engine checks: buying power ✓, position limit ✓,
   price collar ✓ ($150.10 is within 5% of last trade $150.05)
   Time: ~3 μs

4. SEQUENCER → MATCHING ENGINE (AAPL)
   The AAPL matching engine receives the order.
   Current order book:
     Best Ask: $150.05 × 200 (OrderID=B7, 200 shares)
   
   Crossing condition: Buy $150.10 >= Sell $150.05? YES!
   Match: 100 shares @ $150.05 (resting order's price)
   Buyer gets price improvement: saved $0.05/share = $5.00
   Time: ~0.5 μs (matching logic)

5. MATCHING ENGINE → OUTPUTS
   a) Trade report: BUY A1 vs SELL B7, 100@$150.05
   b) Execution report to FirmA: FILLED 100@$150.05
   c) Execution report to FirmB: PARTIAL FILL 100@$150.05
      (B7 has 100 shares remaining)
   d) Market data update: new best ask, last trade
   Time: ~1 μs (message generation)

6. MARKET DATA PUBLISHER → ALL SUBSCRIBERS
   Multicast UDP: Last trade AAPL 100@$150.05
   Multicast UDP: New top of book
   Time: ~1-5 μs (multicast)

TOTAL: ~15-20 μs gateway-to-gateway (co-located)
       Well under the 1ms target.

Scaling Considerations

Horizontal Scaling via Symbol Partitioning

The primary scaling mechanism is symbol partitioning. Each symbol has its own matching engine running on a dedicated CPU core. A modern server with 64 cores can run 64 independent matching engines.

Scaling Model:
  Server 1 (64 cores): AAPL, GOOG, MSFT, ... (64 symbols)
  Server 2 (64 cores): AMZN, TSLA, META, ... (64 symbols)
  Server 3 (64 cores): JPM, BAC, WFC, ...    (64 symbols)
  ...

  For 8,000 listed symbols at ~64 symbols/server:
  ~125 servers for matching engines

  Most symbols are low-volume. The top 50 symbols generate
  ~80% of order flow. High-volume symbols get dedicated servers.

Gateway Scaling

Gateways scale horizontally. Add more gateways to handle more client connections. Load balancing is done at the connection level — each client session is assigned to one gateway.

Market Data Scaling

Market data is the highest-bandwidth component. During peak activity, an exchange generates millions of market data messages per second. Scaling strategies:

Key Design Decisions Summary

DecisionChoiceRationale
Matching engine threadingSingle-threaded per symbolEliminates locks, maximizes cache locality, enables deterministic replay
Scaling strategySymbol partitioningOrders for the same symbol must be processed sequentially; different symbols are independent
SequencingCentral sequencer with replicationSingle source of truth for event ordering; enables deterministic replay
PersistenceEvent sourcing with snapshotsComplete audit trail + fast recovery
Market data transportMulticast UDPLow latency, fair (simultaneous) delivery, efficient for one-to-many
Client protocolFIX (with binary variants for speed)Industry standard, broad client support
Matching algorithmPrice-time priorityFair, simple, widely understood; standard for equity markets

Final Summary

Designing a stock exchange is one of the most constrained system design problems: ultra-low latency, perfect correctness, deterministic processing, and complete auditability — all at massive scale. The key architectural principles are:

The fundamental insight: An exchange is not a typical distributed system. The core matching logic is intentionally centralized — a single-threaded, deterministic state machine processing a totally ordered event stream. Distributed systems techniques (consensus, eventual consistency, sharding) are used for surrounding infrastructure (gateways, market data, reporting), but the matching engine itself is as simple and fast as possible.