High Level Design Series· Data Storage · Part 5 of 70

Graph Databases (Neo4j)

April 2026 · 16 min read

Why Graph Databases Exist

Relational databases model the world as tables and foreign keys. When the relationships between entities are just as important as the entities themselves—and those relationships are deep, many-to-many, and recursive—SQL starts to struggle. Consider a social network: "Find friends of friends of Alice who live in Berlin and work at startups." In SQL, this requires multiple self-joins on a user table, each join multiplying query cost. In a graph database, it’s a single traversal.

A graph database stores data as nodes (entities) and edges (relationships). It is purpose-built for queries that follow connections: social networks, fraud rings, recommendation engines, knowledge graphs, network topologies, and dependency trees. The fundamental insight is that relationships are first-class citizens, stored physically alongside the data they connect—not computed at query time via expensive joins.

The JOIN cost problem: In a relational database, a 3-hop friend-of-friend query on 10 million users requires 3 self-joins. Each join scans an index, computes a hash or sort-merge, and produces intermediate result sets. On a properly indexed table, this query can take 30–60 seconds. In Neo4j with index-free adjacency, the same query completes in 2–5 milliseconds—because it only visits the nodes actually connected to Alice, not the entire table.

10×

Faster for 3+ hop queries

O(1)

Edge traversal cost

34B+

Nodes in Neo4j (max per db)

2ms

Typical 3-hop traversal

The Property Graph Model

The property graph model is the data model used by Neo4j, Amazon Neptune (in its openCypher mode), TigerGraph, and many other modern graph databases. It consists of three primitives:

Nodes (Vertices)

Nodes represent entities. Every node can have:

Labels — Zero or more type tags that categorize the node. A node can be :Person, :Employee, and :Investor simultaneously. Labels act like interfaces in programming: a node can implement multiple.
Properties — Key-value pairs stored directly on the node. Keys are strings; values are primitives (string, integer, float, boolean, date) or arrays of primitives. For example: {name: "Alice", age: 32, city: "Berlin"}.
Internal ID — A system-generated 64-bit identifier used for physical storage. You should never rely on these in application code; they can be recycled after node deletion.

Edges (Relationships)

Edges connect exactly two nodes (a start node and an end node) and are always directed. Key characteristics:

Type — Every edge has exactly one type (e.g., :FRIENDS_WITH, :WORKS_AT, :PURCHASED). This is mandatory—unlike node labels, relationships must have a type.
Properties — Edges can carry properties too. A :FRIENDS_WITH edge might store {since: 2019, closeness: 0.87}. A :PURCHASED edge might store {quantity: 3, total: 59.97, date: "2025-12-01"}.
Direction — Edges are stored with a direction, but they can always be traversed in either direction at query time. (Alice)-[:FRIENDS_WITH]->(Bob) can also be queried as (Bob)<-[:FRIENDS_WITH]-(Alice) or even direction-agnostic: (Alice)-[:FRIENDS_WITH]-(Bob).

Properties

Properties are the attribute layer. They attach to both nodes and edges, supporting these data types in Neo4j:

Type	Example	Notes
`String`	`"Alice"`	UTF-8, up to 2 GB
`Integer`	`42`	64-bit signed
`Float`	`3.14`	64-bit IEEE 754
`Boolean`	`true`	—
`Date`	`date("2025-12-01")`	ISO 8601
`DateTime`	`datetime("2025-12-01T09:30:00Z")`	Timezone-aware
`Duration`	`duration("P30D")`	ISO 8601 durations
`Point`	`point({latitude: 52.52, longitude: 13.40})`	2D/3D spatial
`List`	`["java", "python", "go"]`	Homogeneous arrays of any primitive

// Example: Creating a property graph in Cypher
CREATE (alice:Person:Developer {name: "Alice", age: 32, city: "Berlin"})
CREATE (bob:Person:Designer {name: "Bob", age: 28, city: "London"})
CREATE (carol:Person:Developer {name: "Carol", age: 35, city: "Berlin"})
CREATE (dave:Person:Manager {name: "Dave", age: 41, city: "NYC"})
CREATE (techCorp:Company {name: "TechCorp", industry: "SaaS", founded: 2015})
CREATE (designLab:Company {name: "DesignLab", industry: "Agency", founded: 2018})

CREATE (alice)-[:FRIENDS_WITH {since: 2019, closeness: 0.9}]->(bob)
CREATE (alice)-[:FRIENDS_WITH {since: 2020, closeness: 0.7}]->(carol)
CREATE (bob)-[:FRIENDS_WITH {since: 2021, closeness: 0.6}]->(dave)
CREATE (carol)-[:FRIENDS_WITH {since: 2018, closeness: 0.8}]->(dave)
CREATE (alice)-[:WORKS_AT {role: "Senior Engineer", since: 2021}]->(techCorp)
CREATE (bob)-[:WORKS_AT {role: "Lead Designer", since: 2020}]->(designLab)
CREATE (carol)-[:WORKS_AT {role: "Staff Engineer", since: 2019}]->(techCorp)
CREATE (dave)-[:WORKS_AT {role: "VP Engineering", since: 2017}]->(techCorp)

Index-Free Adjacency

Index-free adjacency is the core architectural innovation that makes graph databases fast for relationship traversals. In a relational database, to find all friends of Alice, you must:

Look up Alice’s user_id in the users table (index scan).
Scan the friendships table’s index for rows where user_id_1 = alice_id or user_id_2 = alice_id.
For each matching row, look up the friend’s record in the users table (another index scan).

Each step involves an index lookup—typically a B-tree traversal costing O(log n) where n is the total number of rows in the table. For 10 million users, that’s about 23 levels of B-tree per lookup.

In a graph database with index-free adjacency, every node contains a direct physical pointer (a file offset or memory address) to its adjacent nodes. Finding Alice’s friends means:

Go to Alice’s node record (one lookup).
Follow the linked list of her relationship records. Each relationship record contains a pointer to the neighbor node. No index is consulted—it’s a direct memory dereference.

The cost of traversing one edge is O(1)—it doesn’t matter whether the database has 1,000 nodes or 10 billion. This is why graph queries get relatively faster as the dataset grows: the SQL approach’s B-tree gets deeper (more I/O per join), while the graph’s pointer chase remains constant-time.

Neo4j’s Physical Storage Layout

Neo4j stores data in fixed-size record files on disk. Each record type has a predictable byte offset, enabling O(1) lookups by internal ID:

Store File	Record Size	Contents
`neostore.nodestore.db`	15 bytes	In-use flag, first relationship ID, first property ID, label pointer
`neostore.relationshipstore.db`	34 bytes	Start node ID, end node ID, relationship type, next/prev pointers for both nodes’ chains, first property ID
`neostore.propertystore.db`	41 bytes	Property key ID, value (inline for small values), next property pointer
`neostore.relationshiptypestore.db`	variable	Type name tokens (FRIENDS_WITH, WORKS_AT, etc.)

Because node records are exactly 15 bytes, finding node #42 is a simple calculation: offset = 42 × 15. The relationship chain is a doubly-linked list, so adding or removing relationships is O(1) at the head. This architecture also means that a traversal step is at most two disk seeks (one for the relationship record, one for the target node)—and in practice, hot data lives in the page cache, making traversals sub-microsecond.

Page cache tuning: Neo4j’s performance depends critically on keeping the graph in memory. The dbms.memory.pagecache.size setting should ideally be set to the total size of your store files. For a 50 GB graph, allocate at least 50 GB of page cache. Cold cache traversals are 100× slower because every pointer chase becomes a disk seek.

Cypher Query Language

Cypher is Neo4j’s declarative query language. Its syntax uses ASCII art to represent graph patterns: () for nodes, --> for relationships, and [] for relationship details. If SQL is designed for tabular data, Cypher is designed for connected data.

Basic Pattern Matching

// Find all of Alice's friends
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)
RETURN friend.name, friend.city

// Find friends of friends (2 hops)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*2]->(fof)
WHERE fof <> a
RETURN DISTINCT fof.name

// Variable-length paths (1 to 5 hops)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*1..5]->(distant)
WHERE distant <> a
RETURN DISTINCT distant.name, length(shortestPath(
  (a)-[:FRIENDS_WITH*]-(distant)
)) AS distance

CRUD Operations

// CREATE — add a node with label and properties
CREATE (eve:Person:Analyst {name: "Eve", age: 29, city: "Paris"})
RETURN eve

// CREATE — add a relationship
MATCH (a:Person {name: "Alice"}), (e:Person {name: "Eve"})
CREATE (a)-[:FRIENDS_WITH {since: 2025, closeness: 0.5}]->(e)

// MERGE — create if not exists (idempotent upsert)
MERGE (p:Person {name: "Frank"})
ON CREATE SET p.age = 27, p.city = "Tokyo", p.createdAt = datetime()
ON MATCH SET p.lastSeen = datetime()
RETURN p

// SET — update properties
MATCH (a:Person {name: "Alice"})
SET a.age = 33, a.skills = ["go", "rust", "python"]
RETURN a

// DELETE — remove a node and all its relationships
MATCH (e:Person {name: "Eve"})
DETACH DELETE e

Aggregation & Projection

// Count friends per person
MATCH (p:Person)-[:FRIENDS_WITH]-(friend)
RETURN p.name, count(friend) AS friendCount
ORDER BY friendCount DESC

// Collect friends into a list
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]-(friend)
RETURN a.name, collect(friend.name) AS friends

// Pattern comprehension (inline sub-query)
MATCH (p:Person)
RETURN p.name,
  [(p)-[:WORKS_AT]->(c) | c.name] AS companies,
  size([(p)-[:FRIENDS_WITH]-() | 1]) AS friendCount

Advanced: Shortest Path & Graph Algorithms

// Shortest path between two people
MATCH path = shortestPath(
  (a:Person {name: "Alice"})-[:FRIENDS_WITH*]-(d:Person {name: "Dave"})
)
RETURN [n IN nodes(path) | n.name] AS route,
       length(path) AS hops

// All shortest paths
MATCH path = allShortestPaths(
  (a:Person {name: "Alice"})-[:FRIENDS_WITH*]-(d:Person {name: "Dave"})
)
RETURN [n IN nodes(path) | n.name] AS route

// Weighted shortest path (Dijkstra via APOC)
CALL apoc.algo.dijkstra(
  startNode, endNode, 'FRIENDS_WITH', 'closeness'
) YIELD path, weight
RETURN path, weight

Indexing in Cypher

// Create a B-tree index on Person.name for fast lookups
CREATE INDEX person_name_idx FOR (p:Person) ON (p.name)

// Composite index
CREATE INDEX person_city_age_idx FOR (p:Person) ON (p.city, p.age)

// Full-text search index (backed by Lucene)
CREATE FULLTEXT INDEX person_search FOR (p:Person) ON EACH [p.name, p.city]

// Use full-text search
CALL db.index.fulltext.queryNodes("person_search", "Berlin~")
YIELD node, score
RETURN node.name, score
ORDER BY score DESC
LIMIT 10

// Show all indexes
SHOW INDEXES

Graph Traversal Algorithms

Graph databases aren’t just about storing and querying data—they also provide powerful built-in algorithms for analyzing the structure of your graph. Neo4j’s Graph Data Science (GDS) library offers 60+ algorithms across several categories:

Pathfinding Algorithms

Algorithm	Purpose	Time Complexity	Use Case
BFS (Breadth-First Search)	Find shortest unweighted path	O(V + E)	Social network degree of separation
Dijkstra	Shortest weighted path	O((V + E) log V)	Route optimization, lowest-cost path
A*	Heuristic shortest path	O(E) average	Geospatial routing with lat/lng heuristic
Yen’s K-Shortest	Top-K alternative routes	O(KV(V + E) log V)	Multi-path routing, backup routes
Random Walk	Probabilistic traversal	O(walk length)	Graph embedding, node2vec input

Centrality Algorithms

Centrality measures identify the most important nodes in a graph:

// PageRank — identify influential nodes
CALL gds.pageRank.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC
LIMIT 10

// Betweenness Centrality — find bridge nodes
CALL gds.betweennessCentrality.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC

// Degree Centrality — most connected nodes
MATCH (p:Person)
RETURN p.name, size([(p)-[:FRIENDS_WITH]-() | 1]) AS degree
ORDER BY degree DESC

Community Detection

// Louvain — find communities/clusters
CALL gds.louvain.stream('myGraph')
YIELD nodeId, communityId
RETURN communityId,
       collect(gds.util.asNode(nodeId).name) AS members,
       count(*) AS size
ORDER BY size DESC

// Label Propagation — fast community detection
CALL gds.labelPropagation.stream('myGraph')
YIELD nodeId, communityId
RETURN communityId, count(*) AS size
ORDER BY size DESC

// Triangle Count — clustering coefficient
CALL gds.triangleCount.stream('myGraph')
YIELD nodeId, triangleCount
RETURN gds.util.asNode(nodeId).name, triangleCount

▶ Graph Traversal: Friends of Friends of Alice

Cypher: MATCH (a:Person {name:"Alice"})-[:FRIENDS_WITH*2]->(fof) RETURN fof
Step through the BFS traversal. Visited nodes turn green, active edges highlight.

Neo4j Architecture

Neo4j is the most mature and widely deployed graph database (first released in 2007, now at version 5.x). Its architecture reflects two decades of optimization for graph workloads.

Storage Engine

Neo4j uses a native graph storage engine—unlike some "graph databases" that layer a graph API on top of a relational or document store (which lose the index-free adjacency benefit). The storage layer consists of:

Record files — Fixed-size records for nodes (15B), relationships (34B), properties (41B), and string/array overflow stores. Fixed sizes enable O(1) lookup by ID: offset = id × recordSize.
Transaction log — Write-ahead log (WAL) for ACID compliance. All mutations are first written to the transaction log, then applied to the store files. This ensures crash recovery.
Page cache — Memory-mapped file cache that keeps hot graph data in RAM. The most critical performance setting is dbms.memory.pagecache.size.
ID allocation — Freed IDs from deleted nodes/relationships are recycled to prevent store file growth.

Query Engine

When a Cypher query arrives, it goes through:

Parsing — Tokenize and parse the Cypher string into an AST.
Semantic analysis — Validate labels, types, and property references.
Planning — The cost-based optimizer generates multiple execution plans and picks the cheapest one based on statistics (node counts, relationship counts, index selectivity).
Execution — The plan is executed as a pipeline of operators: NodeByLabelScan, Expand(All), Filter, ProduceResults, etc.

// View the execution plan of a query
EXPLAIN MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*1..3]->(fof)
RETURN DISTINCT fof.name

// View actual execution statistics
PROFILE MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*1..3]->(fof)
RETURN DISTINCT fof.name

Clustering & High Availability

Neo4j 5.x uses a Raft-based clustering architecture:

Component	Role	Details
Primary	Read + Write	Accepts all write transactions. Replicates via Raft consensus to secondaries. Automatic leader election on failure.
Secondary	Read-only	Receives replicated transactions asynchronously. Scales read throughput linearly. Eventually consistent (lag typically < 100ms).
Raft Consensus	Consistency	Majority-based commitment. A 3-node cluster tolerates 1 failure. Write acknowledged when majority confirms.

Neo4j Configuration for Production

# neo4j.conf — key production settings

# Memory
server.memory.heap.initial_size=8g
server.memory.heap.max_size=8g
dbms.memory.pagecache.size=16g
dbms.memory.transaction.total.max=4g

# Network
server.default_listen_address=0.0.0.0
server.bolt.listen_address=:7687
server.http.listen_address=:7474

# Clustering (3-node Raft cluster)
dbms.cluster.discovery.endpoints=core1:5000,core2:5000,core3:5000
server.cluster.raft.leader_election_timeout=7s
server.cluster.catchup.upstream_strategy=leader-only

# Security
dbms.security.auth_enabled=true
dbms.security.procedures.unrestricted=gds.*,apoc.*

# Performance tuning
db.tx_log.rotation.retention_policy=2 days
dbms.logs.query.enabled=INFO
dbms.logs.query.threshold=1000ms
dbms.checkpoint.interval.time=15m

# GDS (Graph Data Science)
gds.enterprise.license_file=/etc/neo4j/gds.license

Memory rule of thumb: For a production Neo4j deployment: Heap = 8–16 GB (for query processing and GC). Page cache = size of your store files (for disk I/O avoidance). OS memory = remaining RAM for filesystem caches and the OS. On a 64 GB machine with a 30 GB graph: heap=12g, pagecache=32g, leaving 20 GB for the OS.

Amazon Neptune

Amazon Neptune is AWS’s managed graph database service. Unlike Neo4j, Neptune supports two query models: property graphs (via openCypher and Gremlin) and RDF graphs (via SPARQL). This makes it versatile but also means it doesn’t use native graph storage internally—it’s built on a distributed, replicated storage engine similar to Aurora.

Feature	Neo4j	Amazon Neptune
Storage model	Native graph (index-free adjacency)	Distributed log-structured storage
Query languages	Cypher	openCypher, Gremlin, SPARQL
Hosting	Self-hosted, Aura (cloud)	Fully managed AWS service
Max graph size	34B nodes (single instance)	~64 TB storage (auto-scaling)
Replication	Raft-based clustering	6-way replication across 3 AZs
Backups	Manual or scheduled	Continuous to S3, point-in-time restore
Graph algorithms	GDS library (60+ algorithms)	Neptune Analytics (newer, limited)
ACID transactions	Full ACID	Full ACID (serializable isolation)
Pricing model	License-based or Aura subscription	Pay-per-instance-hour + I/O + storage
Best for	Deep graph analytics, algorithm-heavy workloads	AWS-native apps, multi-model (RDF + property graph)

// Neptune openCypher example (same syntax as Neo4j for basic queries)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)-[:WORKS_AT]->(company)
RETURN friend.name, company.name

// Neptune Gremlin example (for comparison)
g.V().has('Person', 'name', 'Alice')
  .out('FRIENDS_WITH')
  .out('WORKS_AT')
  .valueMap('name')

Real-World Use Cases

1. Social Networks

Social graphs are the canonical use case. Facebook’s social graph has over 2 billion nodes (users) and hundreds of billions of edges (friendships, likes, comments, shares). Queries like "mutual friends," "people you may know," and "content from friends of friends" are all graph traversals.

// Mutual friends between Alice and Dave
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]-(mutual)-[:FRIENDS_WITH]-(d:Person {name: "Dave"})
RETURN mutual.name

// People You May Know (friends of friends, excluding existing friends)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]-(friend)-[:FRIENDS_WITH]-(suggestion)
WHERE suggestion <> a
  AND NOT (a)-[:FRIENDS_WITH]-(suggestion)
RETURN suggestion.name, count(friend) AS mutualFriends
ORDER BY mutualFriends DESC
LIMIT 10

2. Fraud Detection

Financial institutions use graph databases to detect fraud rings: groups of accounts connected by shared addresses, phone numbers, devices, or IP addresses that would be invisible in tabular data. A fraudster might use unique names and emails but share a physical address or device fingerprint with other synthetic identities.

// Detect potential fraud ring: accounts sharing phone + address
MATCH (a1:Account)-[:HAS_PHONE]->(phone)<-[:HAS_PHONE]-(a2:Account),
      (a1)-[:HAS_ADDRESS]->(addr)<-[:HAS_ADDRESS]-(a2)
WHERE a1 <> a2
WITH a1, a2, phone, addr
MATCH ring = (a1)-[:TRANSFERRED_TO*1..4]-(a2)
RETURN a1.accountId, a2.accountId, phone.number, addr.street,
       length(ring) AS ringSize
ORDER BY ringSize DESC

// Real-time fraud scoring: check transaction against known patterns
MATCH (sender:Account {id: $txSenderId})-[:TRANSFERRED_TO]->(receiver:Account)
WITH sender, count(receiver) AS txCountLast24h
WHERE txCountLast24h > 50
RETURN sender.id, txCountLast24h, "HIGH_VELOCITY" AS flag

Real-world impact: HSBC reduced fraud investigation time from days to seconds by migrating from SQL-based fraud detection to a Neo4j-powered graph. Traditional SQL queries scanning billions of rows for ring patterns took hours; Neo4j traversals completed in milliseconds because they only visit connected nodes.

3. Knowledge Graphs

Knowledge graphs represent structured information about the world: entities (people, places, concepts) and their relationships. Google’s Knowledge Graph powers the information panels you see in search results. Wikipedia’s Wikidata is an open knowledge graph with 100+ million items.

// Build a knowledge graph
CREATE (py:Technology {name: "Python", type: "Language", year: 1991})
CREATE (django:Technology {name: "Django", type: "Framework", year: 2005})
CREATE (flask:Technology {name: "Flask", type: "Framework", year: 2010})
CREATE (ml:Concept {name: "Machine Learning"})
CREATE (web:Concept {name: "Web Development"})

CREATE (django)-[:BUILT_WITH]->(py)
CREATE (flask)-[:BUILT_WITH]->(py)
CREATE (py)-[:USED_FOR]->(ml)
CREATE (py)-[:USED_FOR]->(web)
CREATE (django)-[:USED_FOR]->(web)

// Query: What can I build with Python?
MATCH (py:Technology {name: "Python"})-[:USED_FOR]->(concept)
OPTIONAL MATCH (framework)-[:BUILT_WITH]->(py)
RETURN concept.name AS useCase, collect(DISTINCT framework.name) AS frameworks

4. Recommendation Engines

Collaborative filtering via graph traversal: "Users who bought X also bought Y" becomes a 2-hop traversal through (:User)-[:PURCHASED]->(:Product)<-[:PURCHASED]-(:User)-[:PURCHASED]->(:Product). Graph-based recommendations capture transitive similarity that matrix factorization approaches miss.

// Collaborative filtering: recommend products
MATCH (user:User {id: $userId})-[:PURCHASED]->(product)<-[:PURCHASED]-(other)
      -[:PURCHASED]->(recommendation)
WHERE NOT (user)-[:PURCHASED]->(recommendation)
RETURN recommendation.name,
       count(DISTINCT other) AS score,
       collect(DISTINCT product.name)[..3] AS becauseYouBought
ORDER BY score DESC
LIMIT 10

// Content-based + collaborative hybrid
MATCH (user:User {id: $userId})-[:PURCHASED]->(p:Product)-[:IN_CATEGORY]->(cat)
WITH user, cat, count(p) AS affinity
ORDER BY affinity DESC
LIMIT 3
MATCH (cat)<-[:IN_CATEGORY]-(rec:Product)
WHERE NOT (user)-[:PURCHASED]->(rec)
  AND rec.rating >= 4.0
RETURN rec.name, rec.rating, cat.name AS category
ORDER BY rec.rating DESC
LIMIT 10

When Graph > Relational

Graph databases are not a universal replacement for relational databases. They excel at specific workloads and add overhead for others. Here’s a precise decision framework:

Choose Graph When…	Stick With Relational When…
Queries involve 3+ JOIN levels	Queries are primarily single-table or 1-2 JOINs
Relationships are many-to-many and recursive	Relationships are simple foreign keys (1:N)
Schema evolves frequently (schema-optional)	Schema is stable and well-defined upfront
You need real-time traversal (< 10ms)	You need complex aggregations (SUM, GROUP BY)
Graph algorithms (PageRank, community detection) are needed	You need strong transactional guarantees across many tables
The data is inherently a network/graph	The data is inherently tabular (orders, invoices, inventory)

Performance Comparison: Social Network Query

Let’s benchmark a specific query: “Find friends of friends of Alice who live in Berlin.”

SQL (PostgreSQL) — 3 Self-JOINs

SELECT DISTINCT u3.name
FROM users u1
JOIN friendships f1
  ON u1.id = f1.user_id_1
JOIN users u2
  ON f1.user_id_2 = u2.id
JOIN friendships f2
  ON u2.id = f2.user_id_1
JOIN users u3
  ON f2.user_id_2 = u3.id
WHERE u1.name = 'Alice'
  AND u3.city = 'Berlin'
  AND u3.id != u1.id;

10M users, 50M friendships:
Execution time: ~32 seconds
Rows scanned: ~4.2 million
Temp disk usage: ~180 MB
Plan: 3× Hash Join + Seq Scan

Cypher (Neo4j) — Single Traversal

MATCH (a:Person {name: "Alice"})
      -[:FRIENDS_WITH*2]->(fof)
WHERE fof.city = "Berlin"
  AND fof <> a
RETURN DISTINCT fof.name

10M users, 50M friendships:
Execution time: ~3 ms
Nodes visited: ~1,200 (only Alice’s neighborhood)
Memory: ~8 KB
Plan: NodeIndexSeek → Expand → Filter

The SQL query must scan the friendships table index twice (once per hop), joining against the users table each time. Even with optimal indexes, the hash join intermediate results grow exponentially with hop count. The graph query starts at Alice’s node and follows exactly her edges—the total work scales with Alice’s neighborhood size, not the total graph size.

The scaling cliff: At 2 hops, SQL is 10× slower. At 3 hops, it’s 100× slower. At 4 hops, most SQL databases hit query timeouts (>30 sec). At 5 hops, it becomes computationally infeasible without pre-materialization. Graph databases maintain near-constant performance through 5–6 hops because each hop only visits the immediate neighbors—there’s no global scan.

▶ SQL JOINs vs Graph Traversal

Left: SQL needs 3 self-joins scanning millions of rows. Right: Graph traverses only the connected neighborhood. Watch the difference.

Practical Tips & Anti-Patterns

Data Modeling Best Practices

Favor relationships over properties. If you find yourself storing an array of IDs in a node property, that should be a set of relationships instead. Relationships enable traversal; property arrays don’t.
Use specific relationship types. :PURCHASED, :REVIEWED, :WISHLISTED are better than a generic :INTERACTED_WITH with a type property. Specific types allow the query planner to prune irrelevant edges early.
Model time with relationship properties. Instead of creating :FRIENDS_2019, :FRIENDS_2020, etc., use a single :FRIENDS_WITH {since: 2019}. Cypher can filter on relationship properties efficiently.
Avoid super nodes. A node with millions of relationships (like a “Global” category node) becomes a bottleneck. Shard it: :Category_US, :Category_EU, etc., or use a fanout pattern with intermediate nodes.
Index your lookup properties. If you always start queries with MATCH (p:Person {name: $name}), create an index on Person.name. Without it, Neo4j does a label scan (O(n) over all Person nodes).

Common Anti-Patterns

Anti-Pattern	Problem	Fix
Unbounded variable-length paths `*`	Traverses entire graph, causes OOM	Always bound: `*1..5`
Loading CSVs without `USING PERIODIC COMMIT`	Huge transaction fills heap	Use `CALL { ... } IN TRANSACTIONS OF 10000 ROWS`
Cartesian products in MATCH	Two disconnected patterns create N×M combinations	Connect patterns or use `WITH` to pipeline
Treating graph DB as document store	Huge JSON blobs in node properties defeat purpose	Decompose into nodes + relationships
No indexes on starting nodes	Every query does a full label scan	Create indexes on frequently queried properties

System Design Interview Guide

In a system design interview, graph databases come up in specific scenarios. Here’s when to reach for one and how to discuss it:

When to Propose a Graph Database

“Design a social network” — The friend graph, feed generation based on connections, mutual friends, and “people you may know” are all graph-native problems.
“Design a fraud detection system” — Detecting ring patterns (shared addresses, devices, phone numbers) across accounts is a graph pattern matching problem.
“Design a recommendation engine” — Collaborative filtering via user–product–user traversal. Also: content-based via knowledge graph.
“Design a knowledge base/search” — Entity relationships, ontology, concept hierarchies.
“Design a network topology monitor” — Routers, switches, links, failure propagation.

Key Talking Points

// Your cheat sheet for the interview:

1. Data model: Property graph (nodes + edges + properties)
2. Key advantage: Index-free adjacency → O(1) edge traversal
3. Query language: Cypher (pattern matching with ASCII art syntax)
4. Storage: Neo4j uses fixed-size records (15B node, 34B relationship)
5. Scaling: Vertical + read replicas (graph sharding is HARD)
6. When NOT to use: Heavy aggregations, simple CRUD, tabular reports
7. Alternatives: Neptune (AWS managed), TigerGraph (distributed), 
                 JanusGraph (open source + pluggable backend)
8. Graph sharding challenge: Cutting a graph into partitions 
   inevitably cuts edges, requiring cross-partition traversals

The Sharding Challenge

Sharding a graph database is fundamentally harder than sharding a key-value store or relational database. The reason: graph queries cross partition boundaries. If Alice is on Shard 1 and her friend Bob is on Shard 2, a friend-of-friend query must make a network hop between shards. This is why most graph databases (Neo4j included) favor vertical scaling + read replicas over horizontal sharding.

Approaches to graph partitioning:

Hash partitioning — Distributes nodes evenly but randomly, maximizing cross-partition edges. Poor for traversals.
Community-based partitioning — Uses community detection to place tightly-connected subgraphs on the same partition. Better but requires re-partitioning as the graph evolves.
TigerGraph’s approach — Stores every edge on both partitions (the start node’s and the end node’s). Doubles storage but enables local traversals. TigerGraph claims linear horizontal scalability with this approach.

Key Takeaways

Property graph model — Nodes with labels and properties, edges with types and properties. Simple but powerful enough to model any domain.
Index-free adjacency — The killer feature. Each node physically points to its neighbors. Edge traversal is O(1) regardless of graph size.
Cypher — Declarative, pattern-based query language. ASCII art syntax makes graph patterns intuitive: (a)-[:REL]->(b).
Performance — Graph databases are 10–1000× faster than SQL for multi-hop traversals. The advantage grows with hop count.
Neo4j architecture — Native graph storage, WAL for ACID, Raft clustering for HA, and the GDS library for 60+ graph algorithms.
Amazon Neptune — Managed alternative on AWS. Supports openCypher, Gremlin, and SPARQL. Better for operational workloads on AWS.
Use when — Social networks, fraud detection, knowledge graphs, recommendations, network topology, dependency analysis.
Don’t use when — Simple CRUD, heavy aggregations, tabular reporting, or when relationships are simple 1:N foreign keys.

Quick Reference: Cypher Cheat Sheet

Operation	Cypher
Create node	`CREATE (n:Label {prop: val})`
Create relationship	`CREATE (a)-[:TYPE {prop: val}]->(b)`
Match pattern	`MATCH (a)-[:TYPE]->(b) RETURN b`
Variable hops	`MATCH (a)-[:TYPE*2..5]->(b)`
Shortest path	`shortestPath((a)-[*]-(b))`
Update	`MATCH (n) SET n.prop = val`
Upsert	`MERGE (n:Label {key: val})`
Delete	`MATCH (n) DETACH DELETE n`
Aggregation	`RETURN count(n), collect(n.name)`
Create index	`CREATE INDEX FOR (n:Label) ON (n.prop)`
Execution plan	`EXPLAIN / PROFILE`