Graph Databases (Neo4j)
Why Graph Databases Exist
Relational databases model the world as tables and foreign keys. When the relationships between entities are just as important as the entities themselves—and those relationships are deep, many-to-many, and recursive—SQL starts to struggle. Consider a social network: "Find friends of friends of Alice who live in Berlin and work at startups." In SQL, this requires multiple self-joins on a user table, each join multiplying query cost. In a graph database, it’s a single traversal.
A graph database stores data as nodes (entities) and edges (relationships). It is purpose-built for queries that follow connections: social networks, fraud rings, recommendation engines, knowledge graphs, network topologies, and dependency trees. The fundamental insight is that relationships are first-class citizens, stored physically alongside the data they connect—not computed at query time via expensive joins.
The Property Graph Model
The property graph model is the data model used by Neo4j, Amazon Neptune (in its openCypher mode), TigerGraph, and many other modern graph databases. It consists of three primitives:
Nodes (Vertices)
Nodes represent entities. Every node can have:
- Labels — Zero or more type tags that categorize the node. A node can be
:Person,:Employee, and:Investorsimultaneously. Labels act like interfaces in programming: a node can implement multiple. - Properties — Key-value pairs stored directly on the node. Keys are strings; values are primitives (string, integer, float, boolean, date) or arrays of primitives. For example:
{name: "Alice", age: 32, city: "Berlin"}. - Internal ID — A system-generated 64-bit identifier used for physical storage. You should never rely on these in application code; they can be recycled after node deletion.
Edges (Relationships)
Edges connect exactly two nodes (a start node and an end node) and are always directed. Key characteristics:
- Type — Every edge has exactly one type (e.g.,
:FRIENDS_WITH,:WORKS_AT,:PURCHASED). This is mandatory—unlike node labels, relationships must have a type. - Properties — Edges can carry properties too. A
:FRIENDS_WITHedge might store{since: 2019, closeness: 0.87}. A:PURCHASEDedge might store{quantity: 3, total: 59.97, date: "2025-12-01"}. - Direction — Edges are stored with a direction, but they can always be traversed in either direction at query time.
(Alice)-[:FRIENDS_WITH]->(Bob)can also be queried as(Bob)<-[:FRIENDS_WITH]-(Alice)or even direction-agnostic:(Alice)-[:FRIENDS_WITH]-(Bob).
Properties
Properties are the attribute layer. They attach to both nodes and edges, supporting these data types in Neo4j:
| Type | Example | Notes |
|---|---|---|
String | "Alice" | UTF-8, up to 2 GB |
Integer | 42 | 64-bit signed |
Float | 3.14 | 64-bit IEEE 754 |
Boolean | true | — |
Date | date("2025-12-01") | ISO 8601 |
DateTime | datetime("2025-12-01T09:30:00Z") | Timezone-aware |
Duration | duration("P30D") | ISO 8601 durations |
Point | point({latitude: 52.52, longitude: 13.40}) | 2D/3D spatial |
List | ["java", "python", "go"] | Homogeneous arrays of any primitive |
// Example: Creating a property graph in Cypher
CREATE (alice:Person:Developer {name: "Alice", age: 32, city: "Berlin"})
CREATE (bob:Person:Designer {name: "Bob", age: 28, city: "London"})
CREATE (carol:Person:Developer {name: "Carol", age: 35, city: "Berlin"})
CREATE (dave:Person:Manager {name: "Dave", age: 41, city: "NYC"})
CREATE (techCorp:Company {name: "TechCorp", industry: "SaaS", founded: 2015})
CREATE (designLab:Company {name: "DesignLab", industry: "Agency", founded: 2018})
CREATE (alice)-[:FRIENDS_WITH {since: 2019, closeness: 0.9}]->(bob)
CREATE (alice)-[:FRIENDS_WITH {since: 2020, closeness: 0.7}]->(carol)
CREATE (bob)-[:FRIENDS_WITH {since: 2021, closeness: 0.6}]->(dave)
CREATE (carol)-[:FRIENDS_WITH {since: 2018, closeness: 0.8}]->(dave)
CREATE (alice)-[:WORKS_AT {role: "Senior Engineer", since: 2021}]->(techCorp)
CREATE (bob)-[:WORKS_AT {role: "Lead Designer", since: 2020}]->(designLab)
CREATE (carol)-[:WORKS_AT {role: "Staff Engineer", since: 2019}]->(techCorp)
CREATE (dave)-[:WORKS_AT {role: "VP Engineering", since: 2017}]->(techCorp)
Index-Free Adjacency
Index-free adjacency is the core architectural innovation that makes graph databases fast for relationship traversals. In a relational database, to find all friends of Alice, you must:
- Look up Alice’s
user_idin the users table (index scan). - Scan the
friendshipstable’s index for rows whereuser_id_1 = alice_idoruser_id_2 = alice_id. - For each matching row, look up the friend’s record in the users table (another index scan).
Each step involves an index lookup—typically a B-tree traversal costing O(log n) where n is the total number of rows in the table. For 10 million users, that’s about 23 levels of B-tree per lookup.
In a graph database with index-free adjacency, every node contains a direct physical pointer (a file offset or memory address) to its adjacent nodes. Finding Alice’s friends means:
- Go to Alice’s node record (one lookup).
- Follow the linked list of her relationship records. Each relationship record contains a pointer to the neighbor node. No index is consulted—it’s a direct memory dereference.
The cost of traversing one edge is O(1)—it doesn’t matter whether the database has 1,000 nodes or 10 billion. This is why graph queries get relatively faster as the dataset grows: the SQL approach’s B-tree gets deeper (more I/O per join), while the graph’s pointer chase remains constant-time.
Neo4j’s Physical Storage Layout
Neo4j stores data in fixed-size record files on disk. Each record type has a predictable byte offset, enabling O(1) lookups by internal ID:
| Store File | Record Size | Contents |
|---|---|---|
neostore.nodestore.db | 15 bytes | In-use flag, first relationship ID, first property ID, label pointer |
neostore.relationshipstore.db | 34 bytes | Start node ID, end node ID, relationship type, next/prev pointers for both nodes’ chains, first property ID |
neostore.propertystore.db | 41 bytes | Property key ID, value (inline for small values), next property pointer |
neostore.relationshiptypestore.db | variable | Type name tokens (FRIENDS_WITH, WORKS_AT, etc.) |
Because node records are exactly 15 bytes, finding node #42 is a simple calculation: offset = 42 × 15. The relationship chain is a doubly-linked list, so adding or removing relationships is O(1) at the head. This architecture also means that a traversal step is at most two disk seeks (one for the relationship record, one for the target node)—and in practice, hot data lives in the page cache, making traversals sub-microsecond.
dbms.memory.pagecache.size setting should ideally be set to the total size of your store files. For a 50 GB graph, allocate at least 50 GB of page cache. Cold cache traversals are 100× slower because every pointer chase becomes a disk seek.
Cypher Query Language
Cypher is Neo4j’s declarative query language. Its syntax uses ASCII art to represent graph patterns: () for nodes, --> for relationships, and [] for relationship details. If SQL is designed for tabular data, Cypher is designed for connected data.
Basic Pattern Matching
// Find all of Alice's friends
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)
RETURN friend.name, friend.city
// Find friends of friends (2 hops)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*2]->(fof)
WHERE fof <> a
RETURN DISTINCT fof.name
// Variable-length paths (1 to 5 hops)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*1..5]->(distant)
WHERE distant <> a
RETURN DISTINCT distant.name, length(shortestPath(
(a)-[:FRIENDS_WITH*]-(distant)
)) AS distance
CRUD Operations
// CREATE — add a node with label and properties
CREATE (eve:Person:Analyst {name: "Eve", age: 29, city: "Paris"})
RETURN eve
// CREATE — add a relationship
MATCH (a:Person {name: "Alice"}), (e:Person {name: "Eve"})
CREATE (a)-[:FRIENDS_WITH {since: 2025, closeness: 0.5}]->(e)
// MERGE — create if not exists (idempotent upsert)
MERGE (p:Person {name: "Frank"})
ON CREATE SET p.age = 27, p.city = "Tokyo", p.createdAt = datetime()
ON MATCH SET p.lastSeen = datetime()
RETURN p
// SET — update properties
MATCH (a:Person {name: "Alice"})
SET a.age = 33, a.skills = ["go", "rust", "python"]
RETURN a
// DELETE — remove a node and all its relationships
MATCH (e:Person {name: "Eve"})
DETACH DELETE e
Aggregation & Projection
// Count friends per person
MATCH (p:Person)-[:FRIENDS_WITH]-(friend)
RETURN p.name, count(friend) AS friendCount
ORDER BY friendCount DESC
// Collect friends into a list
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]-(friend)
RETURN a.name, collect(friend.name) AS friends
// Pattern comprehension (inline sub-query)
MATCH (p:Person)
RETURN p.name,
[(p)-[:WORKS_AT]->(c) | c.name] AS companies,
size([(p)-[:FRIENDS_WITH]-() | 1]) AS friendCount
Advanced: Shortest Path & Graph Algorithms
// Shortest path between two people
MATCH path = shortestPath(
(a:Person {name: "Alice"})-[:FRIENDS_WITH*]-(d:Person {name: "Dave"})
)
RETURN [n IN nodes(path) | n.name] AS route,
length(path) AS hops
// All shortest paths
MATCH path = allShortestPaths(
(a:Person {name: "Alice"})-[:FRIENDS_WITH*]-(d:Person {name: "Dave"})
)
RETURN [n IN nodes(path) | n.name] AS route
// Weighted shortest path (Dijkstra via APOC)
CALL apoc.algo.dijkstra(
startNode, endNode, 'FRIENDS_WITH', 'closeness'
) YIELD path, weight
RETURN path, weight
Indexing in Cypher
// Create a B-tree index on Person.name for fast lookups
CREATE INDEX person_name_idx FOR (p:Person) ON (p.name)
// Composite index
CREATE INDEX person_city_age_idx FOR (p:Person) ON (p.city, p.age)
// Full-text search index (backed by Lucene)
CREATE FULLTEXT INDEX person_search FOR (p:Person) ON EACH [p.name, p.city]
// Use full-text search
CALL db.index.fulltext.queryNodes("person_search", "Berlin~")
YIELD node, score
RETURN node.name, score
ORDER BY score DESC
LIMIT 10
// Show all indexes
SHOW INDEXES
Graph Traversal Algorithms
Graph databases aren’t just about storing and querying data—they also provide powerful built-in algorithms for analyzing the structure of your graph. Neo4j’s Graph Data Science (GDS) library offers 60+ algorithms across several categories:
Pathfinding Algorithms
| Algorithm | Purpose | Time Complexity | Use Case |
|---|---|---|---|
| BFS (Breadth-First Search) | Find shortest unweighted path | O(V + E) | Social network degree of separation |
| Dijkstra | Shortest weighted path | O((V + E) log V) | Route optimization, lowest-cost path |
| A* | Heuristic shortest path | O(E) average | Geospatial routing with lat/lng heuristic |
| Yen’s K-Shortest | Top-K alternative routes | O(KV(V + E) log V) | Multi-path routing, backup routes |
| Random Walk | Probabilistic traversal | O(walk length) | Graph embedding, node2vec input |
Centrality Algorithms
Centrality measures identify the most important nodes in a graph:
// PageRank — identify influential nodes
CALL gds.pageRank.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC
LIMIT 10
// Betweenness Centrality — find bridge nodes
CALL gds.betweennessCentrality.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC
// Degree Centrality — most connected nodes
MATCH (p:Person)
RETURN p.name, size([(p)-[:FRIENDS_WITH]-() | 1]) AS degree
ORDER BY degree DESC
Community Detection
// Louvain — find communities/clusters
CALL gds.louvain.stream('myGraph')
YIELD nodeId, communityId
RETURN communityId,
collect(gds.util.asNode(nodeId).name) AS members,
count(*) AS size
ORDER BY size DESC
// Label Propagation — fast community detection
CALL gds.labelPropagation.stream('myGraph')
YIELD nodeId, communityId
RETURN communityId, count(*) AS size
ORDER BY size DESC
// Triangle Count — clustering coefficient
CALL gds.triangleCount.stream('myGraph')
YIELD nodeId, triangleCount
RETURN gds.util.asNode(nodeId).name, triangleCount
▶ Graph Traversal: Friends of Friends of Alice
Cypher: MATCH (a:Person {name:"Alice"})-[:FRIENDS_WITH*2]->(fof) RETURN fof
Step through the BFS traversal. Visited nodes turn green, active edges highlight.
Neo4j Architecture
Neo4j is the most mature and widely deployed graph database (first released in 2007, now at version 5.x). Its architecture reflects two decades of optimization for graph workloads.
Storage Engine
Neo4j uses a native graph storage engine—unlike some "graph databases" that layer a graph API on top of a relational or document store (which lose the index-free adjacency benefit). The storage layer consists of:
- Record files — Fixed-size records for nodes (15B), relationships (34B), properties (41B), and string/array overflow stores. Fixed sizes enable O(1) lookup by ID:
offset = id × recordSize. - Transaction log — Write-ahead log (WAL) for ACID compliance. All mutations are first written to the transaction log, then applied to the store files. This ensures crash recovery.
- Page cache — Memory-mapped file cache that keeps hot graph data in RAM. The most critical performance setting is
dbms.memory.pagecache.size. - ID allocation — Freed IDs from deleted nodes/relationships are recycled to prevent store file growth.
Query Engine
When a Cypher query arrives, it goes through:
- Parsing — Tokenize and parse the Cypher string into an AST.
- Semantic analysis — Validate labels, types, and property references.
- Planning — The cost-based optimizer generates multiple execution plans and picks the cheapest one based on statistics (node counts, relationship counts, index selectivity).
- Execution — The plan is executed as a pipeline of operators:
NodeByLabelScan,Expand(All),Filter,ProduceResults, etc.
// View the execution plan of a query
EXPLAIN MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*1..3]->(fof)
RETURN DISTINCT fof.name
// View actual execution statistics
PROFILE MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH*1..3]->(fof)
RETURN DISTINCT fof.name
Clustering & High Availability
Neo4j 5.x uses a Raft-based clustering architecture:
| Component | Role | Details |
|---|---|---|
| Primary | Read + Write | Accepts all write transactions. Replicates via Raft consensus to secondaries. Automatic leader election on failure. |
| Secondary | Read-only | Receives replicated transactions asynchronously. Scales read throughput linearly. Eventually consistent (lag typically < 100ms). |
| Raft Consensus | Consistency | Majority-based commitment. A 3-node cluster tolerates 1 failure. Write acknowledged when majority confirms. |
Neo4j Configuration for Production
# neo4j.conf — key production settings
# Memory
server.memory.heap.initial_size=8g
server.memory.heap.max_size=8g
dbms.memory.pagecache.size=16g
dbms.memory.transaction.total.max=4g
# Network
server.default_listen_address=0.0.0.0
server.bolt.listen_address=:7687
server.http.listen_address=:7474
# Clustering (3-node Raft cluster)
dbms.cluster.discovery.endpoints=core1:5000,core2:5000,core3:5000
server.cluster.raft.leader_election_timeout=7s
server.cluster.catchup.upstream_strategy=leader-only
# Security
dbms.security.auth_enabled=true
dbms.security.procedures.unrestricted=gds.*,apoc.*
# Performance tuning
db.tx_log.rotation.retention_policy=2 days
dbms.logs.query.enabled=INFO
dbms.logs.query.threshold=1000ms
dbms.checkpoint.interval.time=15m
# GDS (Graph Data Science)
gds.enterprise.license_file=/etc/neo4j/gds.license
Amazon Neptune
Amazon Neptune is AWS’s managed graph database service. Unlike Neo4j, Neptune supports two query models: property graphs (via openCypher and Gremlin) and RDF graphs (via SPARQL). This makes it versatile but also means it doesn’t use native graph storage internally—it’s built on a distributed, replicated storage engine similar to Aurora.
| Feature | Neo4j | Amazon Neptune |
|---|---|---|
| Storage model | Native graph (index-free adjacency) | Distributed log-structured storage |
| Query languages | Cypher | openCypher, Gremlin, SPARQL |
| Hosting | Self-hosted, Aura (cloud) | Fully managed AWS service |
| Max graph size | 34B nodes (single instance) | ~64 TB storage (auto-scaling) |
| Replication | Raft-based clustering | 6-way replication across 3 AZs |
| Backups | Manual or scheduled | Continuous to S3, point-in-time restore |
| Graph algorithms | GDS library (60+ algorithms) | Neptune Analytics (newer, limited) |
| ACID transactions | Full ACID | Full ACID (serializable isolation) |
| Pricing model | License-based or Aura subscription | Pay-per-instance-hour + I/O + storage |
| Best for | Deep graph analytics, algorithm-heavy workloads | AWS-native apps, multi-model (RDF + property graph) |
// Neptune openCypher example (same syntax as Neo4j for basic queries)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)-[:WORKS_AT]->(company)
RETURN friend.name, company.name
// Neptune Gremlin example (for comparison)
g.V().has('Person', 'name', 'Alice')
.out('FRIENDS_WITH')
.out('WORKS_AT')
.valueMap('name')
Real-World Use Cases
1. Social Networks
Social graphs are the canonical use case. Facebook’s social graph has over 2 billion nodes (users) and hundreds of billions of edges (friendships, likes, comments, shares). Queries like "mutual friends," "people you may know," and "content from friends of friends" are all graph traversals.
// Mutual friends between Alice and Dave
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]-(mutual)-[:FRIENDS_WITH]-(d:Person {name: "Dave"})
RETURN mutual.name
// People You May Know (friends of friends, excluding existing friends)
MATCH (a:Person {name: "Alice"})-[:FRIENDS_WITH]-(friend)-[:FRIENDS_WITH]-(suggestion)
WHERE suggestion <> a
AND NOT (a)-[:FRIENDS_WITH]-(suggestion)
RETURN suggestion.name, count(friend) AS mutualFriends
ORDER BY mutualFriends DESC
LIMIT 10
2. Fraud Detection
Financial institutions use graph databases to detect fraud rings: groups of accounts connected by shared addresses, phone numbers, devices, or IP addresses that would be invisible in tabular data. A fraudster might use unique names and emails but share a physical address or device fingerprint with other synthetic identities.
// Detect potential fraud ring: accounts sharing phone + address
MATCH (a1:Account)-[:HAS_PHONE]->(phone)<-[:HAS_PHONE]-(a2:Account),
(a1)-[:HAS_ADDRESS]->(addr)<-[:HAS_ADDRESS]-(a2)
WHERE a1 <> a2
WITH a1, a2, phone, addr
MATCH ring = (a1)-[:TRANSFERRED_TO*1..4]-(a2)
RETURN a1.accountId, a2.accountId, phone.number, addr.street,
length(ring) AS ringSize
ORDER BY ringSize DESC
// Real-time fraud scoring: check transaction against known patterns
MATCH (sender:Account {id: $txSenderId})-[:TRANSFERRED_TO]->(receiver:Account)
WITH sender, count(receiver) AS txCountLast24h
WHERE txCountLast24h > 50
RETURN sender.id, txCountLast24h, "HIGH_VELOCITY" AS flag
3. Knowledge Graphs
Knowledge graphs represent structured information about the world: entities (people, places, concepts) and their relationships. Google’s Knowledge Graph powers the information panels you see in search results. Wikipedia’s Wikidata is an open knowledge graph with 100+ million items.
// Build a knowledge graph
CREATE (py:Technology {name: "Python", type: "Language", year: 1991})
CREATE (django:Technology {name: "Django", type: "Framework", year: 2005})
CREATE (flask:Technology {name: "Flask", type: "Framework", year: 2010})
CREATE (ml:Concept {name: "Machine Learning"})
CREATE (web:Concept {name: "Web Development"})
CREATE (django)-[:BUILT_WITH]->(py)
CREATE (flask)-[:BUILT_WITH]->(py)
CREATE (py)-[:USED_FOR]->(ml)
CREATE (py)-[:USED_FOR]->(web)
CREATE (django)-[:USED_FOR]->(web)
// Query: What can I build with Python?
MATCH (py:Technology {name: "Python"})-[:USED_FOR]->(concept)
OPTIONAL MATCH (framework)-[:BUILT_WITH]->(py)
RETURN concept.name AS useCase, collect(DISTINCT framework.name) AS frameworks
4. Recommendation Engines
Collaborative filtering via graph traversal: "Users who bought X also bought Y" becomes a 2-hop traversal through (:User)-[:PURCHASED]->(:Product)<-[:PURCHASED]-(:User)-[:PURCHASED]->(:Product). Graph-based recommendations capture transitive similarity that matrix factorization approaches miss.
// Collaborative filtering: recommend products
MATCH (user:User {id: $userId})-[:PURCHASED]->(product)<-[:PURCHASED]-(other)
-[:PURCHASED]->(recommendation)
WHERE NOT (user)-[:PURCHASED]->(recommendation)
RETURN recommendation.name,
count(DISTINCT other) AS score,
collect(DISTINCT product.name)[..3] AS becauseYouBought
ORDER BY score DESC
LIMIT 10
// Content-based + collaborative hybrid
MATCH (user:User {id: $userId})-[:PURCHASED]->(p:Product)-[:IN_CATEGORY]->(cat)
WITH user, cat, count(p) AS affinity
ORDER BY affinity DESC
LIMIT 3
MATCH (cat)<-[:IN_CATEGORY]-(rec:Product)
WHERE NOT (user)-[:PURCHASED]->(rec)
AND rec.rating >= 4.0
RETURN rec.name, rec.rating, cat.name AS category
ORDER BY rec.rating DESC
LIMIT 10
When Graph > Relational
Graph databases are not a universal replacement for relational databases. They excel at specific workloads and add overhead for others. Here’s a precise decision framework:
| Choose Graph When… | Stick With Relational When… |
|---|---|
| Queries involve 3+ JOIN levels | Queries are primarily single-table or 1-2 JOINs |
| Relationships are many-to-many and recursive | Relationships are simple foreign keys (1:N) |
| Schema evolves frequently (schema-optional) | Schema is stable and well-defined upfront |
| You need real-time traversal (< 10ms) | You need complex aggregations (SUM, GROUP BY) |
| Graph algorithms (PageRank, community detection) are needed | You need strong transactional guarantees across many tables |
| The data is inherently a network/graph | The data is inherently tabular (orders, invoices, inventory) |
Performance Comparison: Social Network Query
Let’s benchmark a specific query: “Find friends of friends of Alice who live in Berlin.”
SQL (PostgreSQL) — 3 Self-JOINs
SELECT DISTINCT u3.name
FROM users u1
JOIN friendships f1
ON u1.id = f1.user_id_1
JOIN users u2
ON f1.user_id_2 = u2.id
JOIN friendships f2
ON u2.id = f2.user_id_1
JOIN users u3
ON f2.user_id_2 = u3.id
WHERE u1.name = 'Alice'
AND u3.city = 'Berlin'
AND u3.id != u1.id;
10M users, 50M friendships:
Execution time: ~32 seconds
Rows scanned: ~4.2 million
Temp disk usage: ~180 MB
Plan: 3× Hash Join + Seq Scan
Cypher (Neo4j) — Single Traversal
MATCH (a:Person {name: "Alice"})
-[:FRIENDS_WITH*2]->(fof)
WHERE fof.city = "Berlin"
AND fof <> a
RETURN DISTINCT fof.name
10M users, 50M friendships:
Execution time: ~3 ms
Nodes visited: ~1,200 (only Alice’s neighborhood)
Memory: ~8 KB
Plan: NodeIndexSeek → Expand → Filter
The SQL query must scan the friendships table index twice (once per hop), joining against the users table each time. Even with optimal indexes, the hash join intermediate results grow exponentially with hop count. The graph query starts at Alice’s node and follows exactly her edges—the total work scales with Alice’s neighborhood size, not the total graph size.
▶ SQL JOINs vs Graph Traversal
Left: SQL needs 3 self-joins scanning millions of rows. Right: Graph traverses only the connected neighborhood. Watch the difference.
Practical Tips & Anti-Patterns
Data Modeling Best Practices
- Favor relationships over properties. If you find yourself storing an array of IDs in a node property, that should be a set of relationships instead. Relationships enable traversal; property arrays don’t.
- Use specific relationship types.
:PURCHASED,:REVIEWED,:WISHLISTEDare better than a generic:INTERACTED_WITHwith atypeproperty. Specific types allow the query planner to prune irrelevant edges early. - Model time with relationship properties. Instead of creating
:FRIENDS_2019,:FRIENDS_2020, etc., use a single:FRIENDS_WITH {since: 2019}. Cypher can filter on relationship properties efficiently. - Avoid super nodes. A node with millions of relationships (like a “Global” category node) becomes a bottleneck. Shard it:
:Category_US,:Category_EU, etc., or use a fanout pattern with intermediate nodes. - Index your lookup properties. If you always start queries with
MATCH (p:Person {name: $name}), create an index onPerson.name. Without it, Neo4j does a label scan (O(n) over all Person nodes).
Common Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
Unbounded variable-length paths * | Traverses entire graph, causes OOM | Always bound: *1..5 |
Loading CSVs without USING PERIODIC COMMIT | Huge transaction fills heap | Use CALL { ... } IN TRANSACTIONS OF 10000 ROWS |
| Cartesian products in MATCH | Two disconnected patterns create N×M combinations | Connect patterns or use WITH to pipeline |
| Treating graph DB as document store | Huge JSON blobs in node properties defeat purpose | Decompose into nodes + relationships |
| No indexes on starting nodes | Every query does a full label scan | Create indexes on frequently queried properties |
System Design Interview Guide
In a system design interview, graph databases come up in specific scenarios. Here’s when to reach for one and how to discuss it:
When to Propose a Graph Database
- “Design a social network” — The friend graph, feed generation based on connections, mutual friends, and “people you may know” are all graph-native problems.
- “Design a fraud detection system” — Detecting ring patterns (shared addresses, devices, phone numbers) across accounts is a graph pattern matching problem.
- “Design a recommendation engine” — Collaborative filtering via user–product–user traversal. Also: content-based via knowledge graph.
- “Design a knowledge base/search” — Entity relationships, ontology, concept hierarchies.
- “Design a network topology monitor” — Routers, switches, links, failure propagation.
Key Talking Points
// Your cheat sheet for the interview:
1. Data model: Property graph (nodes + edges + properties)
2. Key advantage: Index-free adjacency → O(1) edge traversal
3. Query language: Cypher (pattern matching with ASCII art syntax)
4. Storage: Neo4j uses fixed-size records (15B node, 34B relationship)
5. Scaling: Vertical + read replicas (graph sharding is HARD)
6. When NOT to use: Heavy aggregations, simple CRUD, tabular reports
7. Alternatives: Neptune (AWS managed), TigerGraph (distributed),
JanusGraph (open source + pluggable backend)
8. Graph sharding challenge: Cutting a graph into partitions
inevitably cuts edges, requiring cross-partition traversals
The Sharding Challenge
Sharding a graph database is fundamentally harder than sharding a key-value store or relational database. The reason: graph queries cross partition boundaries. If Alice is on Shard 1 and her friend Bob is on Shard 2, a friend-of-friend query must make a network hop between shards. This is why most graph databases (Neo4j included) favor vertical scaling + read replicas over horizontal sharding.
Approaches to graph partitioning:
- Hash partitioning — Distributes nodes evenly but randomly, maximizing cross-partition edges. Poor for traversals.
- Community-based partitioning — Uses community detection to place tightly-connected subgraphs on the same partition. Better but requires re-partitioning as the graph evolves.
- TigerGraph’s approach — Stores every edge on both partitions (the start node’s and the end node’s). Doubles storage but enables local traversals. TigerGraph claims linear horizontal scalability with this approach.
Key Takeaways
- Property graph model — Nodes with labels and properties, edges with types and properties. Simple but powerful enough to model any domain.
- Index-free adjacency — The killer feature. Each node physically points to its neighbors. Edge traversal is O(1) regardless of graph size.
- Cypher — Declarative, pattern-based query language. ASCII art syntax makes graph patterns intuitive:
(a)-[:REL]->(b). - Performance — Graph databases are 10–1000× faster than SQL for multi-hop traversals. The advantage grows with hop count.
- Neo4j architecture — Native graph storage, WAL for ACID, Raft clustering for HA, and the GDS library for 60+ graph algorithms.
- Amazon Neptune — Managed alternative on AWS. Supports openCypher, Gremlin, and SPARQL. Better for operational workloads on AWS.
- Use when — Social networks, fraud detection, knowledge graphs, recommendations, network topology, dependency analysis.
- Don’t use when — Simple CRUD, heavy aggregations, tabular reporting, or when relationships are simple 1:N foreign keys.
Quick Reference: Cypher Cheat Sheet
| Operation | Cypher |
|---|---|
| Create node | CREATE (n:Label {prop: val}) |
| Create relationship | CREATE (a)-[:TYPE {prop: val}]->(b) |
| Match pattern | MATCH (a)-[:TYPE]->(b) RETURN b |
| Variable hops | MATCH (a)-[:TYPE*2..5]->(b) |
| Shortest path | shortestPath((a)-[*]-(b)) |
| Update | MATCH (n) SET n.prop = val |
| Upsert | MERGE (n:Label {key: val}) |
| Delete | MATCH (n) DETACH DELETE n |
| Aggregation | RETURN count(n), collect(n.name) |
| Create index | CREATE INDEX FOR (n:Label) ON (n.prop) |
| Execution plan | EXPLAIN / PROFILE |