← All Posts
High Level Design Series· Data Storage · Part 5 of 70

Graph Databases (Neo4j)

Why Graph Databases Exist

Relational databases model the world as tables and foreign keys. When the relationships between entities are just as important as the entities themselves—and those relationships are deep, many-to-many, and recursive—SQL starts to struggle. Consider a social network: "Find friends of friends of Alice who live in Berlin and work at startups." In SQL, this requires multiple self-joins on a user table, each join multiplying query cost. In a graph database, it’s a single traversal.

A graph database stores data as nodes (entities) and edges (relationships). It is purpose-built for queries that follow connections: social networks, fraud rings, recommendation engines, knowledge graphs, network topologies, and dependency trees. The fundamental insight is that relationships are first-class citizens, stored physically alongside the data they connect—not computed at query time via expensive joins.

The JOIN cost problem: In a relational database, a 3-hop friend-of-friend query on 10 million users requires 3 self-joins. Each join scans an index, computes a hash or sort-merge, and produces intermediate result sets. On a properly indexed table, this query can take 30–60 seconds. In Neo4j with index-free adjacency, the same query completes in 2–5 milliseconds—because it only visits the nodes actually connected to Alice, not the entire table.
10×
Faster for 3+ hop queries
O(1)
Edge traversal cost
34B+
Nodes in Neo4j (max per db)
2ms
Typical 3-hop traversal

The Property Graph Model

The property graph model is the data model used by Neo4j, Amazon Neptune (in its openCypher mode), TigerGraph, and many other modern graph databases. It consists of three primitives:

Nodes (Vertices)

Nodes represent entities. Every node can have:

Edges (Relationships)

Edges connect exactly two nodes (a start node and an end node) and are always directed. Key characteristics:

Properties

Properties are the attribute layer. They attach to both nodes and edges, supporting these data types in Neo4j:

TypeExampleNotes
String"Alice"UTF-8, up to 2 GB
Integer4264-bit signed
Float3.1464-bit IEEE 754
Booleantrue
Datedate("2025-12-01")ISO 8601
DateTimedatetime("2025-12-01T09:30:00Z")Timezone-aware
Durationduration("P30D")ISO 8601 durations
Pointpoint({latitude: 52.52, longitude: 13.40})2D/3D spatial
List["java", "python", "go"]Homogeneous arrays of any primitive
// Example: Creating a property graph in Cypher
CREATE (alice:Person:Developer {name: "Alice", age: 32, city: "Berlin"})
CREATE (bob:Person:Designer {name: "Bob", age: 28, city: "London"})
CREATE (carol:Person:Developer {name: "Carol", age: 35, city: "Berlin"})
CREATE (dave:Person:Manager {name: "Dave", age: 41, city: "NYC"})
CREATE (techCorp:Company {name: "TechCorp", industry: "SaaS", founded: 2015})
CREATE (designLab:Company {name: "DesignLab", industry: "Agency", founded: 2018})

CREATE (alice)-[:FRIENDS_WITH {since: 2019, closeness: 0.9}]->(bob)
CREATE (alice)-[:FRIENDS_WITH {since: 2020, closeness: 0.7}]->(carol)
CREATE (bob)-[:FRIENDS_WITH {since: 2021, closeness: 0.6}]->(dave)
CREATE (carol)-[:FRIENDS_WITH {since: 2018, closeness: 0.8}]->(dave)
CREATE (alice)-[:WORKS_AT {role: "Senior Engineer", since: 2021}]->(techCorp)
CREATE (bob)-[:WORKS_AT {role: "Lead Designer", since: 2020}]->(designLab)
CREATE (carol)-[:WORKS_AT {role: "Staff Engineer", since: 2019}]->(techCorp)
CREATE (dave)-[:WORKS_AT {role: "VP Engineering", since: 2017}]->(techCorp)

Index-Free Adjacency

Index-free adjacency is the core architectural innovation that makes graph databases fast for relationship traversals. In a relational database, to find all friends of Alice, you must:

  1. Look up Alice’s user_id in the users table (index scan).
  2. Scan the friendships table’s index for rows where user_id_1 = alice_id or user_id_2 = alice_id.
  3. For each matching row, look up the friend’s record in the users table (another index scan).

Each step involves an index lookup—typically a B-tree traversal costing O(log n) where n is the total number of rows in the table. For 10 million users, that’s about 23 levels of B-tree per lookup.

In a graph database with index-free adjacency, every node contains a direct physical pointer (a file offset or memory address) to its adjacent nodes. Finding Alice’s friends means:

  1. Go to Alice’s node record (one lookup).
  2. Follow the linked list of her relationship records. Each relationship record contains a pointer to the neighbor node. No index is consulted—it’s a direct memory dereference.

The cost of traversing one edge is O(1)—it doesn’t matter whether the database has 1,000 nodes or 10 billion. This is why graph queries get relatively faster as the dataset grows: the SQL approach’s B-tree gets deeper (more I/O per join), while the graph’s pointer chase remains constant-time.

Neo4j’s Physical Storage Layout

Neo4j stores data in fixed-size record files on disk. Each record type has a predictable byte offset, enabling O(1) lookups by internal ID:

Store FileRecord SizeContents
neostore.nodestore.db15 bytesIn-use flag, first relationship ID, first property ID, label pointer
neostore.relationshipstore.db34 bytesStart node ID, end node ID, relationship type, next/prev pointers for both nodes’ chains, first property ID
neostore.propertystore.db41 bytesProperty key ID, value (inline for small values), next property pointer
neostore.relationshiptypestore.dbvariableType name tokens (FRIENDS_WITH, WORKS_AT, etc.)

Because node records are exactly 15 bytes, finding node #42 is a simple calculation: offset = 42 × 15. The relationship chain is a doubly-linked list, so adding or removing relationships is O(1) at the head. This architecture also means that a traversal step is at most two disk seeks (one for the relationship record, one for the target node)—and in practice, hot data lives in the page cache, making traversals sub-microsecond.

Page cache tuning: Neo4j’s performance depends critically on keeping the graph in memory. The dbms.memory.pagecache.size setting should ideally be set to the total size of your store files. For a 50 GB graph, allocate at least 50 GB of page cache. Cold cache traversals are 100× slower because every pointer chase becomes a disk seek.

Cypher Query Language

Cypher is Neo4j’s declarative query language. Its syntax uses ASCII art to represent graph patterns: () for nodes, --> for relationships, and [] for relationship details. If SQL is designed for tabular data, Cypher is designed for connected data.

Basic Pattern Matching

// Find all of Alice's friends
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)
RETURN friend.name, friend.city

// Find friends of friends (2 hops)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*2]->(fof)
WHERE fof <> a
RETURN DISTINCT fof.name

// Variable-length paths (1 to 5 hops)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*1..5]->(distant)
WHERE distant <> a
RETURN DISTINCT distant.name, length(shortestPath(
  (a)-[:FRIENDS_WITH*]-(distant)
)) AS distance

CRUD Operations

// CREATE — add a node with label and properties
CREATE (eve:Person:Analyst {name: "Eve", age: 29, city: "Paris"})
RETURN eve

// CREATE — add a relationship
MATCH (a:Person {name: "Alice"}), (e:Person {name: "Eve"})
CREATE (a)-[:FRIENDS_WITH {since: 2025, closeness: 0.5}]->(e)

// MERGE — create if not exists (idempotent upsert)
MERGE (p:Person {name: "Frank"})
ON CREATE SET p.age = 27, p.city = "Tokyo", p.createdAt = datetime()
ON MATCH SET p.lastSeen = datetime()
RETURN p

// SET — update properties
MATCH (a:Person {name: "Alice"})
SET a.age = 33, a.skills = ["go", "rust", "python"]
RETURN a

// DELETE — remove a node and all its relationships
MATCH (e:Person {name: "Eve"})
DETACH DELETE e

Aggregation & Projection

// Count friends per person
MATCH (p:Person)-[:FRIENDS_WITH]-(friend)
RETURN p.name, count(friend) AS friendCount
ORDER BY friendCount DESC

// Collect friends into a list
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]-(friend)
RETURN a.name, collect(friend.name) AS friends

// Pattern comprehension (inline sub-query)
MATCH (p:Person)
RETURN p.name,
  [(p)-[:WORKS_AT]->(c) | c.name] AS companies,
  size([(p)-[:FRIENDS_WITH]-() | 1]) AS friendCount

Advanced: Shortest Path & Graph Algorithms

// Shortest path between two people
MATCH path = shortestPath(
  (a:Person {name: "Alice"})-[:FRIENDS_WITH*]-(d:Person {name: "Dave"})
)
RETURN [n IN nodes(path) | n.name] AS route,
       length(path) AS hops

// All shortest paths
MATCH path = allShortestPaths(
  (a:Person {name: "Alice"})-[:FRIENDS_WITH*]-(d:Person {name: "Dave"})
)
RETURN [n IN nodes(path) | n.name] AS route

// Weighted shortest path (Dijkstra via APOC)
CALL apoc.algo.dijkstra(
  startNode, endNode, 'FRIENDS_WITH', 'closeness'
) YIELD path, weight
RETURN path, weight

Indexing in Cypher

// Create a B-tree index on Person.name for fast lookups
CREATE INDEX person_name_idx FOR (p:Person) ON (p.name)

// Composite index
CREATE INDEX person_city_age_idx FOR (p:Person) ON (p.city, p.age)

// Full-text search index (backed by Lucene)
CREATE FULLTEXT INDEX person_search FOR (p:Person) ON EACH [p.name, p.city]

// Use full-text search
CALL db.index.fulltext.queryNodes("person_search", "Berlin~")
YIELD node, score
RETURN node.name, score
ORDER BY score DESC
LIMIT 10

// Show all indexes
SHOW INDEXES

Graph Traversal Algorithms

Graph databases aren’t just about storing and querying data—they also provide powerful built-in algorithms for analyzing the structure of your graph. Neo4j’s Graph Data Science (GDS) library offers 60+ algorithms across several categories:

Pathfinding Algorithms

AlgorithmPurposeTime ComplexityUse Case
BFS (Breadth-First Search)Find shortest unweighted pathO(V + E)Social network degree of separation
DijkstraShortest weighted pathO((V + E) log V)Route optimization, lowest-cost path
A*Heuristic shortest pathO(E) averageGeospatial routing with lat/lng heuristic
Yen’s K-ShortestTop-K alternative routesO(KV(V + E) log V)Multi-path routing, backup routes
Random WalkProbabilistic traversalO(walk length)Graph embedding, node2vec input

Centrality Algorithms

Centrality measures identify the most important nodes in a graph:

// PageRank — identify influential nodes
CALL gds.pageRank.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC
LIMIT 10

// Betweenness Centrality — find bridge nodes
CALL gds.betweennessCentrality.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC

// Degree Centrality — most connected nodes
MATCH (p:Person)
RETURN p.name, size([(p)-[:FRIENDS_WITH]-() | 1]) AS degree
ORDER BY degree DESC

Community Detection

// Louvain — find communities/clusters
CALL gds.louvain.stream('myGraph')
YIELD nodeId, communityId
RETURN communityId,
       collect(gds.util.asNode(nodeId).name) AS members,
       count(*) AS size
ORDER BY size DESC

// Label Propagation — fast community detection
CALL gds.labelPropagation.stream('myGraph')
YIELD nodeId, communityId
RETURN communityId, count(*) AS size
ORDER BY size DESC

// Triangle Count — clustering coefficient
CALL gds.triangleCount.stream('myGraph')
YIELD nodeId, triangleCount
RETURN gds.util.asNode(nodeId).name, triangleCount

▶ Graph Traversal: Friends of Friends of Alice

Cypher: MATCH (a:Person {name:"Alice"})-[:FRIENDS_WITH*2]->(fof) RETURN fof
Step through the BFS traversal. Visited nodes turn green, active edges highlight.

Neo4j Architecture

Neo4j is the most mature and widely deployed graph database (first released in 2007, now at version 5.x). Its architecture reflects two decades of optimization for graph workloads.

Storage Engine

Neo4j uses a native graph storage engine—unlike some "graph databases" that layer a graph API on top of a relational or document store (which lose the index-free adjacency benefit). The storage layer consists of:

Query Engine

When a Cypher query arrives, it goes through:

  1. Parsing — Tokenize and parse the Cypher string into an AST.
  2. Semantic analysis — Validate labels, types, and property references.
  3. Planning — The cost-based optimizer generates multiple execution plans and picks the cheapest one based on statistics (node counts, relationship counts, index selectivity).
  4. Execution — The plan is executed as a pipeline of operators: NodeByLabelScan, Expand(All), Filter, ProduceResults, etc.
// View the execution plan of a query
EXPLAIN MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*1..3]->(fof)
RETURN DISTINCT fof.name

// View actual execution statistics
PROFILE MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*1..3]->(fof)
RETURN DISTINCT fof.name

Clustering & High Availability

Neo4j 5.x uses a Raft-based clustering architecture:

ComponentRoleDetails
PrimaryRead + WriteAccepts all write transactions. Replicates via Raft consensus to secondaries. Automatic leader election on failure.
SecondaryRead-onlyReceives replicated transactions asynchronously. Scales read throughput linearly. Eventually consistent (lag typically < 100ms).
Raft ConsensusConsistencyMajority-based commitment. A 3-node cluster tolerates 1 failure. Write acknowledged when majority confirms.

Neo4j Configuration for Production

# neo4j.conf — key production settings

# Memory
server.memory.heap.initial_size=8g
server.memory.heap.max_size=8g
dbms.memory.pagecache.size=16g
dbms.memory.transaction.total.max=4g

# Network
server.default_listen_address=0.0.0.0
server.bolt.listen_address=:7687
server.http.listen_address=:7474

# Clustering (3-node Raft cluster)
dbms.cluster.discovery.endpoints=core1:5000,core2:5000,core3:5000
server.cluster.raft.leader_election_timeout=7s
server.cluster.catchup.upstream_strategy=leader-only

# Security
dbms.security.auth_enabled=true
dbms.security.procedures.unrestricted=gds.*,apoc.*

# Performance tuning
db.tx_log.rotation.retention_policy=2 days
dbms.logs.query.enabled=INFO
dbms.logs.query.threshold=1000ms
dbms.checkpoint.interval.time=15m

# GDS (Graph Data Science)
gds.enterprise.license_file=/etc/neo4j/gds.license
Memory rule of thumb: For a production Neo4j deployment: Heap = 8–16 GB (for query processing and GC). Page cache = size of your store files (for disk I/O avoidance). OS memory = remaining RAM for filesystem caches and the OS. On a 64 GB machine with a 30 GB graph: heap=12g, pagecache=32g, leaving 20 GB for the OS.

Amazon Neptune

Amazon Neptune is AWS’s managed graph database service. Unlike Neo4j, Neptune supports two query models: property graphs (via openCypher and Gremlin) and RDF graphs (via SPARQL). This makes it versatile but also means it doesn’t use native graph storage internally—it’s built on a distributed, replicated storage engine similar to Aurora.

FeatureNeo4jAmazon Neptune
Storage modelNative graph (index-free adjacency)Distributed log-structured storage
Query languagesCypheropenCypher, Gremlin, SPARQL
HostingSelf-hosted, Aura (cloud)Fully managed AWS service
Max graph size34B nodes (single instance)~64 TB storage (auto-scaling)
ReplicationRaft-based clustering6-way replication across 3 AZs
BackupsManual or scheduledContinuous to S3, point-in-time restore
Graph algorithmsGDS library (60+ algorithms)Neptune Analytics (newer, limited)
ACID transactionsFull ACIDFull ACID (serializable isolation)
Pricing modelLicense-based or Aura subscriptionPay-per-instance-hour + I/O + storage
Best forDeep graph analytics, algorithm-heavy workloadsAWS-native apps, multi-model (RDF + property graph)
// Neptune openCypher example (same syntax as Neo4j for basic queries)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)-[:WORKS_AT]->(company)
RETURN friend.name, company.name

// Neptune Gremlin example (for comparison)
g.V().has('Person', 'name', 'Alice')
  .out('FRIENDS_WITH')
  .out('WORKS_AT')
  .valueMap('name')

Real-World Use Cases

1. Social Networks

Social graphs are the canonical use case. Facebook’s social graph has over 2 billion nodes (users) and hundreds of billions of edges (friendships, likes, comments, shares). Queries like "mutual friends," "people you may know," and "content from friends of friends" are all graph traversals.

// Mutual friends between Alice and Dave
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]-(mutual)-[:FRIENDS_WITH]-(d:Person {name: "Dave"})
RETURN mutual.name

// People You May Know (friends of friends, excluding existing friends)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]-(friend)-[:FRIENDS_WITH]-(suggestion)
WHERE suggestion <> a
  AND NOT (a)-[:FRIENDS_WITH]-(suggestion)
RETURN suggestion.name, count(friend) AS mutualFriends
ORDER BY mutualFriends DESC
LIMIT 10

2. Fraud Detection

Financial institutions use graph databases to detect fraud rings: groups of accounts connected by shared addresses, phone numbers, devices, or IP addresses that would be invisible in tabular data. A fraudster might use unique names and emails but share a physical address or device fingerprint with other synthetic identities.

// Detect potential fraud ring: accounts sharing phone + address
MATCH (a1:Account)-[:HAS_PHONE]->(phone)<-[:HAS_PHONE]-(a2:Account),
      (a1)-[:HAS_ADDRESS]->(addr)<-[:HAS_ADDRESS]-(a2)
WHERE a1 <> a2
WITH a1, a2, phone, addr
MATCH ring = (a1)-[:TRANSFERRED_TO*1..4]-(a2)
RETURN a1.accountId, a2.accountId, phone.number, addr.street,
       length(ring) AS ringSize
ORDER BY ringSize DESC

// Real-time fraud scoring: check transaction against known patterns
MATCH (sender:Account {id: $txSenderId})-[:TRANSFERRED_TO]->(receiver:Account)
WITH sender, count(receiver) AS txCountLast24h
WHERE txCountLast24h > 50
RETURN sender.id, txCountLast24h, "HIGH_VELOCITY" AS flag
Real-world impact: HSBC reduced fraud investigation time from days to seconds by migrating from SQL-based fraud detection to a Neo4j-powered graph. Traditional SQL queries scanning billions of rows for ring patterns took hours; Neo4j traversals completed in milliseconds because they only visit connected nodes.

3. Knowledge Graphs

Knowledge graphs represent structured information about the world: entities (people, places, concepts) and their relationships. Google’s Knowledge Graph powers the information panels you see in search results. Wikipedia’s Wikidata is an open knowledge graph with 100+ million items.

// Build a knowledge graph
CREATE (py:Technology {name: "Python", type: "Language", year: 1991})
CREATE (django:Technology {name: "Django", type: "Framework", year: 2005})
CREATE (flask:Technology {name: "Flask", type: "Framework", year: 2010})
CREATE (ml:Concept {name: "Machine Learning"})
CREATE (web:Concept {name: "Web Development"})

CREATE (django)-[:BUILT_WITH]->(py)
CREATE (flask)-[:BUILT_WITH]->(py)
CREATE (py)-[:USED_FOR]->(ml)
CREATE (py)-[:USED_FOR]->(web)
CREATE (django)-[:USED_FOR]->(web)

// Query: What can I build with Python?
MATCH (py:Technology {name: "Python"})-[:USED_FOR]->(concept)
OPTIONAL MATCH (framework)-[:BUILT_WITH]->(py)
RETURN concept.name AS useCase, collect(DISTINCT framework.name) AS frameworks

4. Recommendation Engines

Collaborative filtering via graph traversal: "Users who bought X also bought Y" becomes a 2-hop traversal through (:User)-[:PURCHASED]->(:Product)<-[:PURCHASED]-(:User)-[:PURCHASED]->(:Product). Graph-based recommendations capture transitive similarity that matrix factorization approaches miss.

// Collaborative filtering: recommend products
MATCH (user:User {id: $userId})-[:PURCHASED]->(product)<-[:PURCHASED]-(other)
      -[:PURCHASED]->(recommendation)
WHERE NOT (user)-[:PURCHASED]->(recommendation)
RETURN recommendation.name,
       count(DISTINCT other) AS score,
       collect(DISTINCT product.name)[..3] AS becauseYouBought
ORDER BY score DESC
LIMIT 10

// Content-based + collaborative hybrid
MATCH (user:User {id: $userId})-[:PURCHASED]->(p:Product)-[:IN_CATEGORY]->(cat)
WITH user, cat, count(p) AS affinity
ORDER BY affinity DESC
LIMIT 3
MATCH (cat)<-[:IN_CATEGORY]-(rec:Product)
WHERE NOT (user)-[:PURCHASED]->(rec)
  AND rec.rating >= 4.0
RETURN rec.name, rec.rating, cat.name AS category
ORDER BY rec.rating DESC
LIMIT 10

When Graph > Relational

Graph databases are not a universal replacement for relational databases. They excel at specific workloads and add overhead for others. Here’s a precise decision framework:

Choose Graph When…Stick With Relational When…
Queries involve 3+ JOIN levelsQueries are primarily single-table or 1-2 JOINs
Relationships are many-to-many and recursiveRelationships are simple foreign keys (1:N)
Schema evolves frequently (schema-optional)Schema is stable and well-defined upfront
You need real-time traversal (< 10ms)You need complex aggregations (SUM, GROUP BY)
Graph algorithms (PageRank, community detection) are neededYou need strong transactional guarantees across many tables
The data is inherently a network/graphThe data is inherently tabular (orders, invoices, inventory)

Performance Comparison: Social Network Query

Let’s benchmark a specific query: “Find friends of friends of Alice who live in Berlin.”

SQL (PostgreSQL) — 3 Self-JOINs
SELECT DISTINCT u3.name
FROM users u1
JOIN friendships f1
  ON u1.id = f1.user_id_1
JOIN users u2
  ON f1.user_id_2 = u2.id
JOIN friendships f2
  ON u2.id = f2.user_id_1
JOIN users u3
  ON f2.user_id_2 = u3.id
WHERE u1.name = 'Alice'
  AND u3.city = 'Berlin'
  AND u3.id != u1.id;

10M users, 50M friendships:
Execution time: ~32 seconds
Rows scanned: ~4.2 million
Temp disk usage: ~180 MB
Plan: 3× Hash Join + Seq Scan

Cypher (Neo4j) — Single Traversal
MATCH (a:Person {name: "Alice"})
      -[:FRIENDS_WITH*2]->(fof)
WHERE fof.city = "Berlin"
  AND fof <> a
RETURN DISTINCT fof.name

10M users, 50M friendships:
Execution time: ~3 ms
Nodes visited: ~1,200 (only Alice’s neighborhood)
Memory: ~8 KB
Plan: NodeIndexSeek → Expand → Filter

The SQL query must scan the friendships table index twice (once per hop), joining against the users table each time. Even with optimal indexes, the hash join intermediate results grow exponentially with hop count. The graph query starts at Alice’s node and follows exactly her edges—the total work scales with Alice’s neighborhood size, not the total graph size.

The scaling cliff: At 2 hops, SQL is 10× slower. At 3 hops, it’s 100× slower. At 4 hops, most SQL databases hit query timeouts (>30 sec). At 5 hops, it becomes computationally infeasible without pre-materialization. Graph databases maintain near-constant performance through 5–6 hops because each hop only visits the immediate neighbors—there’s no global scan.

▶ SQL JOINs vs Graph Traversal

Left: SQL needs 3 self-joins scanning millions of rows. Right: Graph traverses only the connected neighborhood. Watch the difference.

Practical Tips & Anti-Patterns

Data Modeling Best Practices

Common Anti-Patterns

Anti-PatternProblemFix
Unbounded variable-length paths *Traverses entire graph, causes OOMAlways bound: *1..5
Loading CSVs without USING PERIODIC COMMITHuge transaction fills heapUse CALL { ... } IN TRANSACTIONS OF 10000 ROWS
Cartesian products in MATCHTwo disconnected patterns create N×M combinationsConnect patterns or use WITH to pipeline
Treating graph DB as document storeHuge JSON blobs in node properties defeat purposeDecompose into nodes + relationships
No indexes on starting nodesEvery query does a full label scanCreate indexes on frequently queried properties

System Design Interview Guide

In a system design interview, graph databases come up in specific scenarios. Here’s when to reach for one and how to discuss it:

When to Propose a Graph Database

Key Talking Points

// Your cheat sheet for the interview:

1. Data model: Property graph (nodes + edges + properties)
2. Key advantage: Index-free adjacency → O(1) edge traversal
3. Query language: Cypher (pattern matching with ASCII art syntax)
4. Storage: Neo4j uses fixed-size records (15B node, 34B relationship)
5. Scaling: Vertical + read replicas (graph sharding is HARD)
6. When NOT to use: Heavy aggregations, simple CRUD, tabular reports
7. Alternatives: Neptune (AWS managed), TigerGraph (distributed), 
                 JanusGraph (open source + pluggable backend)
8. Graph sharding challenge: Cutting a graph into partitions 
   inevitably cuts edges, requiring cross-partition traversals

The Sharding Challenge

Sharding a graph database is fundamentally harder than sharding a key-value store or relational database. The reason: graph queries cross partition boundaries. If Alice is on Shard 1 and her friend Bob is on Shard 2, a friend-of-friend query must make a network hop between shards. This is why most graph databases (Neo4j included) favor vertical scaling + read replicas over horizontal sharding.

Approaches to graph partitioning:

Key Takeaways

  1. Property graph model — Nodes with labels and properties, edges with types and properties. Simple but powerful enough to model any domain.
  2. Index-free adjacency — The killer feature. Each node physically points to its neighbors. Edge traversal is O(1) regardless of graph size.
  3. Cypher — Declarative, pattern-based query language. ASCII art syntax makes graph patterns intuitive: (a)-[:REL]->(b).
  4. Performance — Graph databases are 10–1000× faster than SQL for multi-hop traversals. The advantage grows with hop count.
  5. Neo4j architecture — Native graph storage, WAL for ACID, Raft clustering for HA, and the GDS library for 60+ graph algorithms.
  6. Amazon Neptune — Managed alternative on AWS. Supports openCypher, Gremlin, and SPARQL. Better for operational workloads on AWS.
  7. Use when — Social networks, fraud detection, knowledge graphs, recommendations, network topology, dependency analysis.
  8. Don’t use when — Simple CRUD, heavy aggregations, tabular reporting, or when relationships are simple 1:N foreign keys.
Quick Reference: Cypher Cheat Sheet
OperationCypher
Create nodeCREATE (n:Label {prop: val})
Create relationshipCREATE (a)-[:TYPE {prop: val}]->(b)
Match patternMATCH (a)-[:TYPE]->(b) RETURN b
Variable hopsMATCH (a)-[:TYPE*2..5]->(b)
Shortest pathshortestPath((a)-[*]-(b))
UpdateMATCH (n) SET n.prop = val
UpsertMERGE (n:Label {key: val})
DeleteMATCH (n) DETACH DELETE n
AggregationRETURN count(n), collect(n.name)
Create indexCREATE INDEX FOR (n:Label) ON (n.prop)
Execution planEXPLAIN / PROFILE