Design: Stock Exchange
Introduction
A stock exchange is the backbone of global financial markets. Every day, billions of dollars change hands through electronic exchanges like the NYSE, NASDAQ, and the London Stock Exchange. At its heart, an exchange does something deceptively simple: it matches buyers with sellers. But doing this at sub-millisecond latency, for millions of orders per day, with perfect accuracy and complete auditability, is one of the most demanding system design challenges in existence.
Modern exchanges process peak throughputs of over 100,000 messages per second with median latencies under 50 microseconds. A single bit flip, a single out-of-order message, or a single lost trade can trigger regulatory investigations and cost millions. There is no room for "eventual consistency" in an exchange — every order, every match, every cancellation must be deterministic, auditable, and exactly-once.
In this post, we'll design a stock exchange from scratch, covering order types, order book mechanics, the matching engine, sequencing, market data distribution, risk management, and the FIX protocol. We'll also dive deep into the latency optimization techniques that separate real trading systems from typical distributed systems.
Requirements
Functional Requirements
- Place orders — participants can submit buy and sell orders (market, limit, stop) for listed instruments
- Cancel / modify orders — participants can cancel open orders or modify quantity/price
- Match buy and sell orders — the engine matches compatible orders using price-time priority
- Market data feed — broadcast real-time price updates, order book depth, and trade reports to all subscribers
- Trade reporting — generate confirmations for matched trades and report to clearing/settlement systems
- Risk checks — pre-trade validation of account balances, position limits, and regulatory constraints
- Support multiple instruments — the exchange lists thousands of symbols (e.g., AAPL, GOOG, TSLA)
Non-Functional Requirements
| Metric | Target |
|---|---|
| Matching latency (p99) | < 1 ms (top exchanges target < 100 μs) |
| Throughput | Millions of orders/day (100K+ messages/sec peak) |
| Availability | 99.99% during market hours (6.5 hrs/day for US equities) |
| Correctness | Zero tolerance for lost orders, duplicate executions, or reordering |
| Fairness | All participants see the same data at the same time; FIFO ordering at each price level |
| Auditability | Every event must be logged and replayable for regulatory compliance |
Order Types
Before designing the matching engine, we need to understand the different kinds of orders that participants can place. Each order type has different matching semantics and edge cases.
Market Order
A market order says: "I want to buy (or sell) N shares immediately at the best available price." Market orders prioritize execution speed over price. They are guaranteed to execute (as long as there is liquidity on the other side), but the execution price may be worse than expected if the order book is thin.
Market Buy Order:
Symbol: AAPL
Side: BUY
Quantity: 100 shares
Price: N/A (take the best available ask price)
Execution: Immediately matches against the lowest-priced
sell orders in the order book.
Risk: In illiquid markets, a large market order can "walk the book" — consuming multiple price levels and resulting in a much worse average execution price than expected. This is called slippage.
Limit Order
A limit order says: "I want to buy (or sell) N shares at price P or better." A limit buy order will only execute at price ≤ P. A limit sell order will only execute at price ≥ P. If no matching counterparty exists, the order rests in the order book until filled, cancelled, or expired.
Limit Buy Order:
Symbol: AAPL
Side: BUY
Quantity: 100 shares
Price: $150.00 (max price willing to pay)
Execution: Matches against sell orders priced ≤ $150.00.
If no match, rests in the order book at $150.00.
Limit Sell Order:
Symbol: AAPL
Side: SELL
Quantity: 50 shares
Price: $151.00 (min price willing to accept)
Execution: Matches against buy orders priced ≥ $151.00.
If no match, rests in the order book at $151.00.
Limit orders form the backbone of the order book. The vast majority of orders on any exchange are limit orders — they provide liquidity by indicating willingness to trade at specific prices.
Stop Order
A stop order is a conditional order: "When the market price reaches P, convert this into a market order." Stop orders are used for risk management — a stop-loss sell order, for example, automatically sells a position if the price drops below a threshold.
Stop-Loss Sell Order:
Symbol: AAPL
Side: SELL
Quantity: 100 shares
Stop Price: $145.00
Behavior: When AAPL trades at or below $145.00, this
becomes a market sell order and executes
at the best available bid price.
Stop-Limit Sell Order:
Symbol: AAPL
Side: SELL
Quantity: 100 shares
Stop Price: $145.00
Limit Price: $144.50
Behavior: When AAPL trades at or below $145.00, this
becomes a limit sell order at $144.50.
Time-in-Force Qualifiers
Every order also has a time-in-force (TIF) qualifier that specifies how long it remains active:
- Day — valid until end of trading session (most common)
- GTC (Good-Till-Cancelled) — remains active across sessions until filled or cancelled
- IOC (Immediate-or-Cancel) — execute as much as possible immediately, cancel the rest
- FOK (Fill-or-Kill) — execute the entire quantity immediately, or cancel the entire order
- GTD (Good-Till-Date) — remains active until a specified date
Additional Order Types
- Iceberg / Reserve — only a portion of the total quantity is visible in the order book; the rest is hidden and replenished as the visible portion fills
- Pegged — price automatically adjusts to track the best bid/ask (e.g., midpoint peg)
- All-or-None (AON) — must be filled entirely in a single match, or not at all
The Order Book
The order book is the central data structure of any exchange. It maintains all outstanding (resting) limit orders for a given instrument, organized by price and arrival time. There is one order book per symbol.
Structure
An order book has two sides:
- Buy side (bids) — sorted in descending order by price. The highest bid is at the top because buyers want to pay as little as possible, but the most aggressive buyer (highest price) gets priority.
- Sell side (asks/offers) — sorted in ascending order by price. The lowest ask is at the top because sellers want to receive as much as possible, but the most aggressive seller (lowest price) gets priority.
Order Book: AAPL
═══════════════════════════════════════════════
BIDS (Buy) │ ASKS (Sell)
─────────────────────────┼─────────────────────
Price Qty Orders │ Price Qty Orders
$150.10 300 [3 ord] │ $150.20 150 [2 ord]
$150.00 500 [5 ord] │ $150.30 200 [1 ord]
$149.90 200 [2 ord] │ $150.50 400 [4 ord]
$149.80 100 [1 ord] │ $150.60 100 [1 ord]
═══════════════════════════════════════════════
Best Bid: $150.10 │ Best Ask: $150.20
Spread: $0.10 (0.07%)
═══════════════════════════════════════════════
The spread is the difference between the best bid and the best ask. A tight spread indicates a liquid market with active trading. The spread is never negative on a well-functioning exchange — if the best bid exceeds the best ask, those orders would immediately match.
Price Levels
At each price level, multiple orders may exist. Within a price level, orders are arranged in FIFO (first-in, first-out) order — the order that arrived first gets filled first. This combination of price priority and time priority is called price-time priority (or strict FIFO matching).
Price Level: $150.10 (Buy side)
───────────────────────────────────────
Seq# Time Qty Trader
1042 09:30:01 100 FirmA
1087 09:30:02 150 FirmB
1103 09:30:03 50 FirmC
───────────────────────────────────────
Total: 300 shares across 3 orders
When a sell order arrives that matches at $150.10:
→ FirmA's order fills first (arrived earliest)
→ Then FirmB, then FirmC
Data Structure Implementation
The order book requires efficient operations on three axes:
| Operation | Frequency | Target |
|---|---|---|
| Find best bid/ask | Every incoming order | O(1) |
| Add order at price level | Every new limit order | O(1) amortized |
| Remove order by ID | Every cancel/fill | O(1) |
| Walk price levels (matching) | Every crossing order | O(levels consumed) |
| Remove empty price level | When last order at a level fills | O(log P) |
A common implementation uses:
- Sorted map (red-black tree or skip list) keyed by price → gives O(log P) insert/delete of price levels and O(1) access to best price
- Doubly-linked list at each price level → O(1) append (new order) and O(1) removal (cancel) with pointer from order ID
- Hash map from order ID → order node → O(1) lookup for cancels and modifications
struct OrderBook {
bids: BTreeMap<Price, PriceLevel>, // descending
asks: BTreeMap<Price, PriceLevel>, // ascending
orders: HashMap<OrderId, *mut OrderNode>,
}
struct PriceLevel {
price: Price,
total_quantity: u64,
head: *mut OrderNode, // first order (FIFO front)
tail: *mut OrderNode, // last order (FIFO back)
}
struct OrderNode {
order_id: OrderId,
quantity: u64,
timestamp: u64,
prev: *mut OrderNode,
next: *mut OrderNode,
}
The Matching Engine
The matching engine is the heart of the exchange. It receives orders, applies them to the order book, and generates trades (executions) when buy and sell orders are compatible. The matching engine for a given symbol is a single-threaded, deterministic state machine — this is a critical design decision.
Price-Time Priority Algorithm
The most common matching algorithm is price-time priority (also called strict FIFO). The rules are simple:
- Price priority: The best-priced order on the opposite side gets matched first. For a buy, this means the lowest-priced sell. For a sell, this means the highest-priced buy.
- Time priority: Among orders at the same price, the order that arrived first gets matched first (FIFO).
- Crossing condition: A buy order matches with a sell order when the buy price ≥ the sell price.
Matching Algorithm (Pseudocode)
function match(incoming_order, order_book):
if incoming_order.side == BUY:
opposite_book = order_book.asks // ascending price
else:
opposite_book = order_book.bids // descending price
remaining_qty = incoming_order.quantity
while remaining_qty > 0 and opposite_book is not empty:
best_level = opposite_book.first() // best price level
// Check crossing condition
if incoming_order.side == BUY:
if incoming_order is MARKET or incoming_order.price >= best_level.price:
// Prices cross — match!
else:
break // No match possible
else: // SELL
if incoming_order is MARKET or incoming_order.price <= best_level.price:
// Prices cross — match!
else:
break
// Walk the FIFO queue at this price level
while remaining_qty > 0 and best_level is not empty:
resting_order = best_level.head // oldest order
fill_qty = min(remaining_qty, resting_order.quantity)
fill_price = resting_order.price // resting order's price
// Generate execution report
emit Trade(
buy_order_id,
sell_order_id,
symbol,
fill_qty,
fill_price,
timestamp
)
remaining_qty -= fill_qty
resting_order.quantity -= fill_qty
if resting_order.quantity == 0:
remove resting_order from level and order map
if best_level is empty:
remove best_level from opposite_book
// Handle remaining quantity
if remaining_qty > 0:
if incoming_order is LIMIT:
add incoming_order (with remaining_qty) to same-side book
else: // MARKET order with unfilled quantity
cancel remaining (or reject per exchange policy)
return trades
Execution Price Rule
When two orders match, which price is used? The standard rule is: the resting order's price. The order that was already in the book (passive) determines the trade price, not the incoming (aggressive) order. This incentivizes providing liquidity — your limit order will fill at the price you specified, never worse.
Alternative Matching Algorithms
While price-time priority is the most common, some venues use alternative algorithms:
- Pro-rata — at the same price level, quantity is allocated proportionally to order size rather than FIFO. Used in some options and futures exchanges (e.g., CME for some contracts).
- Price-size-time — larger orders at the same price get priority over smaller ones. Rarely used due to fairness concerns.
- Price-time with allocation minimum — pro-rata with a minimum fill size to prevent tiny allocations.
▶ Order Book Matching
Watch a market buy order for 150 shares match against the sell side of the order book using price-time priority.
The Sequencer
In a stock exchange, ordering is everything. If two participants submit orders at nearly the same time, who gets priority? The answer must be deterministic, reproducible, and fair. This is the sequencer's job.
Why We Need a Sequencer
Consider the problem: multiple gateway servers receive orders from different participants over the network. Each gateway has a slightly different local clock. Network jitter means packets arrive at different times. Without a central sequencing mechanism, there is no single truth about the order of events.
The sequencer solves this by assigning a monotonically increasing sequence number to every incoming message before it reaches the matching engine. This creates a single, totally ordered stream of events.
Without Sequencer:
Gateway A receives Order X at T=09:30:00.001234
Gateway B receives Order Y at T=09:30:00.001235
Which arrived first? Depends on clock sync accuracy.
With Sequencer:
Sequencer assigns Order X → Seq #10042
Sequencer assigns Order Y → Seq #10043
Order X is processed first. Period. Deterministic.
Sequencer Design
The sequencer is a simple but critical component:
- Receives messages from all gateways
- Assigns a globally unique, monotonically increasing sequence number
- Writes the sequenced message to a durable log
- Forwards the sequenced message to the matching engine
The sequencer must be single-threaded — a global sequence number requires a single point of serialization. This sounds like a bottleneck, but in practice a well-optimized sequencer can stamp millions of messages per second on modern hardware.
Deterministic Replay
The sequencer enables one of the most powerful properties of exchange architecture: deterministic replay. Since every message has a sequence number and the matching engine is a deterministic state machine, you can replay the entire sequence of events and reconstruct the exact state of the order book at any point in time.
Replay Guarantee:
Given the same sequence of inputs (with sequence numbers),
the matching engine will produce the exact same sequence
of outputs (trades, order book states).
This enables:
✓ Disaster recovery — replay from the log to rebuild state
✓ Regulatory audit — reconstruct the market at any timestamp
✓ Testing — replay production sequences in test environments
✓ Hot standby — a secondary matching engine replays the
same stream and can take over instantly on failure
High Availability
The sequencer is a single point of failure (by design — it must be single-threaded). To achieve high availability:
- Primary-backup replication: The primary sequencer synchronously replicates each sequenced message to a hot standby before acknowledging. On failure, the standby takes over from the last replicated sequence number.
- Raft/Paxos for leader election: A consensus protocol can manage the failover, but adds latency to every message. Some exchanges accept this trade-off; others use simpler heartbeat-based failover.
- Dual-write: Write every message to two independent logs. On failover, the new primary reads from the log and continues from the last sequence number.
▶ Sequencer Flow
Multiple gateways receive orders → Sequencer assigns global sequence numbers → Matching Engine processes in order → deterministic replay possible.
High-Level Architecture
The end-to-end architecture of a stock exchange follows a strict pipeline:
┌──────────────┐
Participants ──────────▶│ Gateways │
(FIX clients) │ (Validate & │
│ Authenticate) │
└──────┬───────┘
│
▼
┌──────────────┐
│ Sequencer │
│ (Assign seq# │
│ + persist) │
└──────┬───────┘
│
┌──────┴───────┐
│ │
▼ ▼
┌────────────┐ ┌────────────┐
│ Risk Check │ │ Matching │
│ Engine │ │ Engine │
│(Pre-trade) │ │(Per-symbol) │
└────────────┘ └──────┬─────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Market │ │ Trade │ │ Reporting │
│ Data Pub │ │ Confirm │ │ / Clearing │
└────────────┘ └────────────┘ └────────────┘
Component Breakdown
1. Gateway
The gateway is the entry point for all client connections. It handles:
- Protocol translation: Parses FIX messages (or proprietary binary protocols) from clients
- Authentication: Verifies client identity and session keys
- Basic validation: Checks message format, required fields, symbol validity
- Rate limiting: Prevents a single participant from overwhelming the system
- Connection management: Manages TCP sessions, heartbeats, and reconnection logic
Multiple gateway instances run in parallel for scalability. Each client connects to one gateway, but gateways are stateless — clients can reconnect to any gateway after a failure.
2. Sequencer
As described above, the sequencer assigns a global ordering to all messages. It sits between the gateways and the matching engine, creating a single stream of sequenced events.
3. Risk Check Engine
Before an order reaches the matching engine, it must pass pre-trade risk checks:
- Account balance: Does the buyer have sufficient funds? Does the seller have the shares?
- Position limits: Would this order exceed the firm's maximum position in this instrument?
- Order size limits: Is the order quantity within acceptable bounds?
- Price collars: Is the order price within a reasonable range of the current market price? (Prevents fat-finger errors)
- Regulatory limits: Short-selling restrictions, circuit breakers, etc.
Risk checks must be fast — they're in the critical path. Some exchanges perform risk checks before sequencing (at the gateway), others after sequencing but before matching. The trade-off is between latency and consistency.
4. Matching Engine
The matching engine runs the price-time priority algorithm described earlier. A critical design decision is: one matching engine instance per symbol. This avoids the need for distributed state, locks, or consensus within the matching logic.
Symbol Partitioning:
AAPL → Matching Engine #1 (Single thread)
GOOG → Matching Engine #2 (Single thread)
TSLA → Matching Engine #3 (Single thread)
AMZN → Matching Engine #4 (Single thread)
...
Each matching engine:
✓ Single-threaded — no locks, no contention
✓ Deterministic — same inputs → same outputs
✓ In-memory — entire order book in RAM
✓ Pinned to a CPU core — avoids context switches
5. Market Data Publisher
After every match or order book change, the market data publisher broadcasts updates to all subscribers. This component is discussed in detail below.
6. Trade Confirmation & Reporting
Executed trades generate confirmations sent back to both buyer and seller. Trade reports are also sent to:
- Clearing house: For trade settlement (T+1 or T+2)
- Regulatory systems: For market surveillance and audit trails
- Drop copy feeds: Backup trade reports for compliance systems
Market Data
Market data is the real-time feed of price and order information that the exchange broadcasts to all participants. It's how traders know what's happening in the market. There are different levels of market data, each with increasing detail.
Level 1 Data (Top of Book)
Level 1 provides the most basic information:
- Best Bid: highest buy price + quantity
- Best Ask: lowest sell price + quantity
- Last Trade: price + quantity of the most recent execution
- Volume: total shares traded today
Level 1 Market Data (AAPL):
Best Bid: $150.10 × 300
Best Ask: $150.20 × 150
Last Trade: $150.15 × 50
Volume: 12,345,678
High: $151.20
Low: $149.80
Open: $150.00
Level 2 Data (Depth of Book)
Level 2 shows the full order book depth — all price levels with their aggregate quantities:
Level 2 Market Data (AAPL):
BIDS ASKS
$150.10 300 (3 orders) $150.20 150 (2 orders)
$150.00 500 (5 orders) $150.30 200 (1 order)
$149.90 200 (2 orders) $150.50 400 (4 orders)
$149.80 100 (1 order) $150.60 100 (1 order)
$149.70 350 (4 orders) $150.70 250 (3 orders)
Level 2 data is essential for algorithmic trading and market making. By seeing the full depth, traders can estimate how much liquidity is available at each price and predict how their orders will fill.
Level 3 Data (Full Order-by-Order)
Level 3 provides every individual order in the book (not just aggregated levels). This is the most granular data:
Level 3 Feed Messages:
ADD OrderID=1042 BUY 100@$150.10 09:30:01.234
ADD OrderID=1043 BUY 150@$150.10 09:30:01.456
ADD OrderID=1044 SELL 50@$150.20 09:30:01.789
TRADE OrderID=1045 vs 1044 50@$150.20
CANCEL OrderID=1042 BUY 100@$150.10
With Level 3 data, a subscriber can reconstruct the exact order book locally, including individual order positions within each price level.
Feed Handlers & Multicast UDP
Market data must be delivered to thousands of subscribers simultaneously with minimal latency. The standard approach uses IP multicast over UDP:
- Multicast: A single packet is sent to a multicast group address, and the network hardware (switches, routers) replicates it to all subscribers. This is far more efficient than sending unicast TCP to each subscriber individually.
- UDP: No connection setup, no acknowledgments, no retransmission — minimum latency. But UDP is unreliable: packets can be lost, duplicated, or reordered.
- Sequence numbers: Every market data message includes a sequence number. Subscribers detect gaps and request retransmissions from a recovery server (typically TCP-based).
Market Data Distribution:
Matching Engine → Market Data Publisher
│
Multicast UDP
(239.1.2.3:50001)
│
┌──────────┼──────────┐
▼ ▼ ▼
Subscriber Subscriber Subscriber
(FirmA) (FirmB) (FirmC)
Gap detected? → Request retransmit via TCP recovery
Multicast Benefits:
✓ Fair — all subscribers receive at the same time
✓ Efficient — one send, many receivers
✓ Low latency — no TCP handshake or ACK overhead
Multicast Risks:
✗ Packet loss (mitigated by sequence numbers + recovery)
✗ Out-of-order (mitigated by sequence numbers)
✗ Network congestion (mitigated by rate limiting)
Risk Management
Risk management in an exchange operates at multiple layers, from pre-trade checks on individual orders to market-wide circuit breakers.
Pre-Trade Risk Checks
Every order passes through risk validation before matching:
| Check | Description | Action on Failure |
|---|---|---|
| Buying power | Buyer has sufficient funds (cash + margin) | Reject order |
| Share availability | Seller has shares to sell (or is permitted to short-sell) | Reject order |
| Position limits | Order won't exceed maximum allowed position | Reject order |
| Price collar | Order price is within X% of last trade (fat-finger protection) | Reject order |
| Order rate limit | Participant hasn't exceeded max orders/second | Throttle or reject |
| Self-trade prevention | Order won't trade against same firm's resting order | Cancel one side |
Circuit Breakers
When prices move too far too fast, circuit breakers halt trading to prevent cascading crashes:
- LULD (Limit Up-Limit Down): Individual stock trading halts if the price moves beyond a band calculated from the recent average price. Band width varies by market cap and time of day (5%, 10%, or 20%).
- Market-wide circuit breakers: Trading in all stocks halts if the S&P 500 declines by 7% (Level 1), 13% (Level 2), or 20% (Level 3) from the previous day's close.
LULD Circuit Breaker (Example):
AAPL Reference Price: $150.00
Band Width: 5%
Upper Limit: $157.50
Lower Limit: $142.50
If AAPL trades at $142.49 → 15-second trading pause
After pause, new bands calculated
If still outside bands → extended halt
Kill Switch
Every participant must implement a "kill switch" — the ability to immediately cancel all open orders and halt new order submission. The exchange provides a mass cancel API for this purpose. This is a regulatory requirement (SEC Rule 15c3-5) designed to prevent runaway algorithms from destabilizing the market.
The FIX Protocol
FIX (Financial Information eXchange) is the de facto standard protocol for electronic trading communication. Developed in 1992 by Fidelity Investments and Salomon Brothers, it is now used by virtually every exchange, broker, and institutional trader worldwide.
Message Structure
FIX messages are tag-value pairs separated by a delimiter (SOH character, ASCII 0x01). Each tag is a number defined in the FIX specification:
FIX New Order (Single) Message:
8=FIX.4.4 (BeginString: protocol version)
9=176 (BodyLength)
35=D (MsgType: D = New Order Single)
49=CLIENT_A (SenderCompID)
56=EXCHANGE (TargetCompID)
34=12 (MsgSeqNum)
52=20260415-09:30:01.234 (SendingTime)
11=ORD-20260415-001 (ClOrdID: client order ID)
55=AAPL (Symbol)
54=1 (Side: 1=Buy, 2=Sell)
38=100 (OrderQty)
40=2 (OrdType: 1=Market, 2=Limit)
44=150.00 (Price)
59=0 (TimeInForce: 0=Day)
10=142 (CheckSum)
Key Message Types
| MsgType (Tag 35) | Name | Direction | Purpose |
|---|---|---|---|
| A | Logon | Both | Authenticate and start session |
| D | New Order Single | Client → Exchange | Submit a new order |
| F | Order Cancel Request | Client → Exchange | Request to cancel an open order |
| G | Order Cancel/Replace | Client → Exchange | Modify an open order (price/qty) |
| 8 | Execution Report | Exchange → Client | Order accepted, filled, cancelled, rejected |
| 9 | Order Cancel Reject | Exchange → Client | Cancel request rejected |
| 0 | Heartbeat | Both | Session keepalive |
| 5 | Logout | Both | End session gracefully |
Session Layer
FIX has a session layer that provides reliability guarantees on top of TCP:
- Sequence numbers: Every message has a monotonically increasing sequence number (Tag 34). Both sides track expected sequence numbers. A gap indicates a missed message.
- Heartbeats: Periodic heartbeat messages (Type 0) detect connection failures. If no message is received within the heartbeat interval, a Test Request (Type 1) is sent. No response → connection is considered dead.
- Resend Request: If a gap is detected (e.g., received seq 15 but expected 13), a Resend Request (Type 2) asks for retransmission of messages 13-14.
- Sequence Reset: Used during recovery to skip over administrative messages that don't need retransmission.
FIX Performance Considerations
While FIX is the industry standard, its tag-value text format is relatively slow to parse. High-frequency trading firms often use:
- FAST (FIX Adapted for Streaming) — binary encoding of FIX messages with field presence maps and delta compression. 5-10× more compact.
- SBE (Simple Binary Encoding) — fixed-length binary format designed for ultra-low-latency. No field delimiters, no tag numbers — just raw bytes at known offsets. Parsing is essentially a pointer cast.
- Proprietary binary protocols — some exchanges (e.g., NASDAQ OUCH, CME iLink) use custom binary protocols that are faster than FIX but require exchange-specific integration.
Encoding Comparison (New Order message):
FIX 4.4 text: ~200 bytes, parse time ~1-5 μs
FAST encoded: ~40 bytes, parse time ~0.5 μs
SBE encoded: ~60 bytes, parse time ~0.1 μs
Proprietary binary: ~30 bytes, parse time <0.1 μs
Event Sourcing & Persistence
A stock exchange is a textbook use case for event sourcing. Instead of storing the current state of the order book, we store every event (order, cancel, trade) as an immutable log entry. The current state is derived by replaying events from the beginning.
Event Log
Every message processed by the exchange is persisted to an append-only event log:
Event Log (Sequenced):
Seq# Timestamp Type Data
10001 09:30:00.000123 NEW BUY 100@$150.10 (OrderID=A1)
10002 09:30:00.000456 NEW SELL 50@$150.20 (OrderID=A2)
10003 09:30:00.001234 NEW BUY 200@$150.20 (OrderID=A3)
10004 09:30:00.001234 TRADE A3 vs A2: 50@$150.20
10005 09:30:00.001234 PARTIAL A3: 150 remaining @$150.20
10006 09:30:00.002345 CANCEL A1 (BUY 100@$150.10)
10007 09:30:00.003456 NEW SELL 100@$150.15 (OrderID=A4)
10008 09:30:00.003456 TRADE A3 vs A4: 100@$150.15
10009 09:30:00.003456 TRADE ... remaining A3: 50@$150.20 rests
Benefits of Event Sourcing
- Complete audit trail: Every order, modification, cancellation, and trade is recorded. Regulators can reconstruct the exact state of the market at any point.
- Deterministic replay: Feed the event log into a fresh matching engine → identical order book state. Used for disaster recovery, testing, and debugging.
- Time-travel queries: "What was the order book for AAPL at 10:42:35.123?" — replay events up to that timestamp.
- Hot standby: A secondary matching engine replays the same event stream in real-time. If the primary fails, the standby has identical state and can take over instantly.
Storage Implementation
The event log has unique requirements:
- Append-only: Events are never modified or deleted (regulatory requirement)
- Sequential write: All writes are sequential appends — perfect for spinning disks and SSDs alike
- High throughput: Must sustain 100K+ writes/second with durability (fsync)
- Low latency: Write must complete before the matching engine can proceed (in the critical path)
In practice, exchanges use:
- Memory-mapped files with periodic fsync — the OS kernel manages write-back, and the application writes at memory speed
- Kernel bypass I/O (io_uring, SPDK) for direct disk access without syscall overhead
- Battery-backed DRAM — writes go to battery-backed RAM at DRAM speed, then flushed to disk asynchronously. Provides durability guarantees with RAM latency.
- Replication before persist — some exchanges replicate the event to a remote standby before persisting to local disk. This provides network-level durability faster than disk fsync.
Snapshots
Replaying millions of events on startup is slow. To speed up recovery, the exchange periodically takes snapshots of the order book state:
Recovery:
1. Load latest snapshot (e.g., order book at Seq# 500000)
2. Replay events from Seq# 500001 onward
3. State is fully recovered
Snapshot frequency: every N events or every T seconds
Snapshot content: full order book state per symbol
Reporting & Clearing
Once a trade executes, it must flow through several downstream systems:
Trade Reporting
The reporting service provides:
- Trade confirmations: Sent to buyer and seller immediately after execution
- Regulatory reporting: Trades reported to the consolidated tape (SIP) for public dissemination
- Audit trail: Complete record for compliance, with CAT (Consolidated Audit Trail) reporting in the US
- Trade history: Queryable database for participants to review their execution history
Clearing & Settlement
After a trade is executed (matched), it must be cleared and settled:
- Clearing: The clearing house (e.g., DTCC in the US) steps in as the counterparty to both sides, ensuring that if one side defaults, the other is still made whole. The clearing house nets obligations across all trades.
- Settlement: Actual exchange of cash and securities. In the US, equity trades settle on T+1 (trade date plus one business day). The DTCC transfers shares from the seller's custodian to the buyer's custodian, and cash in the opposite direction.
Trade Lifecycle:
T+0 (Trade Date):
09:30:01 — Order submitted
09:30:01 — Order matched (trade executes)
09:30:01 — Execution report sent to both parties
09:30:01 — Trade reported to SIP/tape
18:00:00 — End of day: trades sent to clearing house
T+1 (Settlement Date):
Morning — Clearing house calculates net obligations
Afternoon — DTCC settles: shares and cash transfer
Post-Settlement:
Shares in buyer's account, cash in seller's account
Trade is final and irrevocable
Latency Optimization Techniques
The difference between a mediocre exchange and a world-class one often comes down to latency engineering. Here are the key techniques used by top exchanges to achieve sub-100-microsecond matching latency.
Hardware & OS Level
| Technique | Description | Latency Saving |
|---|---|---|
| CPU pinning | Pin the matching engine thread to a dedicated CPU core. Prevents context switches and keeps the working set in L1/L2 cache. | Eliminates ~5-15 μs context switch jitter |
| NUMA awareness | Ensure the matching engine's memory is allocated on the same NUMA node as its CPU core. Cross-node memory access adds ~100ns per access. | ~100ns per memory access saved |
| Huge pages | Use 2MB or 1GB huge pages to reduce TLB misses. Standard 4KB pages cause frequent TLB evictions for large working sets. | Reduces TLB miss latency |
| Kernel bypass networking | Use DPDK or Solarflare OpenOnload to bypass the kernel network stack entirely. Packets go directly from NIC to userspace via shared memory. | Saves ~10-50 μs per packet |
| Busy-polling | Instead of sleeping and being woken by interrupts, the matching engine thread busy-polls the network ring buffer. Wastes CPU but eliminates interrupt latency. | Saves ~5-10 μs interrupt wake-up |
| Disable frequency scaling | Fix CPU frequency at maximum (disable turbo boost and power saving). Frequency transitions cause latency spikes. | Eliminates ~50-100 μs frequency ramp |
Application Level
- Lock-free data structures: Use lock-free queues (e.g., LMAX Disruptor ring buffer) for inter-thread communication. No mutex contention, no priority inversion.
- Object pooling: Pre-allocate all objects (orders, trades, messages) at startup. Zero allocations in the hot path eliminates GC pauses (for JVM-based systems) or malloc latency.
- Cache-friendly data layout: Store order book data in contiguous arrays (struct-of-arrays) rather than pointer-heavy trees (array-of-structs with heap nodes). Cache misses are often the dominant cost.
- Branch prediction optimization: Arrange code so the common path (order doesn't cross the book) is predicted correctly. Use
likely()/unlikely()compiler hints. - Avoid syscalls: Every syscall (even
clock_gettime) costs ~50-100ns due to user-kernel mode switch. Userdtscinstruction for high-resolution timestamps.
Network Level
- Co-location: Place trading servers in the same data center as the exchange (ideally the same rack). Reduces round-trip time from milliseconds to microseconds.
- Cut-through switching: Network switches that forward frames as soon as the destination MAC is read (before the entire frame is received). Saves ~1-5 μs per switch hop.
- FPGA-based NICs: Process FIX messages and perform risk checks directly in the NIC hardware (Xilinx/Intel FPGAs). The message is matched before it even reaches the CPU.
- Equal-length cables: All participants' cables to the exchange are the same length, ensuring no one has a latency advantage from being physically closer. This is a regulatory fairness requirement.
The LMAX Disruptor Pattern
The LMAX Disruptor is a high-performance inter-thread messaging pattern, originally developed for the LMAX exchange. It replaces traditional queues with a pre-allocated ring buffer:
LMAX Disruptor Ring Buffer:
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ ← Pre-allocated slots
└───┴───┴───┴───┴───┴───┴───┴───┘
▲ ▲
│ │
Consumer Producer
(Matching Engine) (Sequencer)
Key properties:
✓ Lock-free — uses CAS (compare-and-swap) for coordination
✓ Pre-allocated — no GC pressure, no allocation latency
✓ Cache-line aligned — each slot on its own cache line
✓ Mechanical sympathy — designed for how CPUs actually work
Performance:
✓ 100M+ messages/sec single-threaded
✓ Sub-microsecond latency per message
✓ No contention, no locks, no syscalls
Failure Modes & Recovery
An exchange must handle failures gracefully without losing or duplicating orders.
Gateway Failure
If a gateway crashes, clients reconnect to another gateway. FIX session recovery (resend request) ensures no messages are lost. The sequencer has already logged all messages from the failed gateway, so no state is lost.
Sequencer Failure
The hot standby sequencer takes over. It has replicated all messages up to the last confirmed sequence number. New messages receive the next sequence number. Matching engine sees a continuous stream.
Sequencer Failover:
Primary: Seq# 10001 ... 10042 ... 10043 [CRASH]
Standby: Seq# 10001 ... 10042 (replicated)
Standby promotes to primary, continues from Seq# 10043
Gap risk: if 10043 was assigned but not replicated
Mitigation: wait for in-flight messages to drain before
accepting new orders (brief pause)
Matching Engine Failure
The matching engine is deterministic. Recovery is straightforward:
- Start a new matching engine instance
- Load the latest snapshot
- Replay events from the event log after the snapshot
- Order book is reconstructed to the exact pre-failure state
- Resume processing new events from the sequencer
If a hot standby was running (replaying the same event stream in real-time), failover is nearly instantaneous — the standby already has the correct state.
Network Partition
Unlike typical distributed systems, an exchange does not face the CAP theorem dilemma in its core matching logic. The matching engine is a single node — there is no distributed state to partition. Network partitions affect:
- Client connectivity: Clients lose connection to gateways. They reconnect when the partition heals. Unmatched orders remain in the book.
- Market data distribution: Subscribers miss updates. Sequence numbers detect gaps; recovery servers fill them.
Complete Order Flow
Let's trace a single order through the entire system to solidify the architecture:
Trader at FirmA submits: BUY 100 AAPL @ $150.10 (Limit)
1. CLIENT → GATEWAY
FirmA's trading system sends a FIX New Order Single (35=D)
over a persistent TCP connection to Gateway #2.
Gateway validates: format ✓, symbol ✓, auth ✓
Time: ~5 μs (network) + ~2 μs (validation)
2. GATEWAY → SEQUENCER
Gateway forwards the validated order to the Sequencer.
Sequencer assigns Seq# 50042, persists to event log.
Time: ~1 μs (sequencing) + ~2 μs (persist)
3. SEQUENCER → RISK ENGINE
Risk engine checks: buying power ✓, position limit ✓,
price collar ✓ ($150.10 is within 5% of last trade $150.05)
Time: ~3 μs
4. SEQUENCER → MATCHING ENGINE (AAPL)
The AAPL matching engine receives the order.
Current order book:
Best Ask: $150.05 × 200 (OrderID=B7, 200 shares)
Crossing condition: Buy $150.10 >= Sell $150.05? YES!
Match: 100 shares @ $150.05 (resting order's price)
Buyer gets price improvement: saved $0.05/share = $5.00
Time: ~0.5 μs (matching logic)
5. MATCHING ENGINE → OUTPUTS
a) Trade report: BUY A1 vs SELL B7, 100@$150.05
b) Execution report to FirmA: FILLED 100@$150.05
c) Execution report to FirmB: PARTIAL FILL 100@$150.05
(B7 has 100 shares remaining)
d) Market data update: new best ask, last trade
Time: ~1 μs (message generation)
6. MARKET DATA PUBLISHER → ALL SUBSCRIBERS
Multicast UDP: Last trade AAPL 100@$150.05
Multicast UDP: New top of book
Time: ~1-5 μs (multicast)
TOTAL: ~15-20 μs gateway-to-gateway (co-located)
Well under the 1ms target.
Scaling Considerations
Horizontal Scaling via Symbol Partitioning
The primary scaling mechanism is symbol partitioning. Each symbol has its own matching engine running on a dedicated CPU core. A modern server with 64 cores can run 64 independent matching engines.
Scaling Model:
Server 1 (64 cores): AAPL, GOOG, MSFT, ... (64 symbols)
Server 2 (64 cores): AMZN, TSLA, META, ... (64 symbols)
Server 3 (64 cores): JPM, BAC, WFC, ... (64 symbols)
...
For 8,000 listed symbols at ~64 symbols/server:
~125 servers for matching engines
Most symbols are low-volume. The top 50 symbols generate
~80% of order flow. High-volume symbols get dedicated servers.
Gateway Scaling
Gateways scale horizontally. Add more gateways to handle more client connections. Load balancing is done at the connection level — each client session is assigned to one gateway.
Market Data Scaling
Market data is the highest-bandwidth component. During peak activity, an exchange generates millions of market data messages per second. Scaling strategies:
- Symbol-based multicast groups: Different symbols on different multicast addresses. Subscribers only join groups for symbols they care about.
- Tiered data: Level 1 on one channel, Level 2 on another, Level 3 on a third. Subscribers choose their tier.
- Conflation: For slow consumers, aggregate multiple updates into a single message (e.g., send the latest state rather than every intermediate change).
Key Design Decisions Summary
| Decision | Choice | Rationale |
|---|---|---|
| Matching engine threading | Single-threaded per symbol | Eliminates locks, maximizes cache locality, enables deterministic replay |
| Scaling strategy | Symbol partitioning | Orders for the same symbol must be processed sequentially; different symbols are independent |
| Sequencing | Central sequencer with replication | Single source of truth for event ordering; enables deterministic replay |
| Persistence | Event sourcing with snapshots | Complete audit trail + fast recovery |
| Market data transport | Multicast UDP | Low latency, fair (simultaneous) delivery, efficient for one-to-many |
| Client protocol | FIX (with binary variants for speed) | Industry standard, broad client support |
| Matching algorithm | Price-time priority | Fair, simple, widely understood; standard for equity markets |
Final Summary
Designing a stock exchange is one of the most constrained system design problems: ultra-low latency, perfect correctness, deterministic processing, and complete auditability — all at massive scale. The key architectural principles are:
- Single-threaded matching engines per symbol — avoid distributed state; a single thread with data in cache is faster than any distributed system.
- Central sequencer — create a single, globally ordered stream of events. This enables deterministic replay, hot standby failover, and regulatory audit.
- Event sourcing — log every event immutably. The order book is a derived view, reconstructable from the log at any point in time.
- Multicast UDP for market data — fair, low-latency, one-to-many distribution. Sequence numbers handle reliability.
- FIX protocol — the industry standard for order management. Binary variants (SBE, FAST) for ultra-low-latency paths.
- Hardware optimization — CPU pinning, kernel bypass networking, huge pages, NUMA awareness, busy-polling. Every microsecond counts.
- Risk management at every layer — pre-trade checks, circuit breakers, kill switches. The exchange must protect itself and the market from errant algorithms.