Skip to main content

Neo4j storage

Neo4j stores Chive's knowledge graph: field taxonomy, authority records, citations, and collaboration networks.

Graph schema

Node types

LabelDescriptionKey Properties
NodeKnowledge graph nodeid, kind, subkind, label, status
GraphEdgeRelationship between nodessourceUri, targetUri, relationSlug, weight
WikidataEntityWikidata Q-IDs for linkingqid, label, description
Node:Object:EprintEprint nodes for graph queriesuri, label, subkind
Node:Object:PersonAuthor nodes for collaborationmetadata.did, label, subkind

GraphNode subkind values:

subkindDescription
fieldHierarchical field taxonomy
facetPMEST classification value
institutionResearch organization
personIndividual authority record
conceptGeneral concept

Relationship types

TypeFromToDescription
PARENT_OFFieldFieldHierarchy
RELATED_TOFieldFieldSemantic similarity
MAPPED_TOAuthorityRecordWikidataEntityExternal linking
TAGGED_WITHEprintFieldField classification
AUTHOREDAuthorEprintAuthorship
CITESEprintEprintCitations
COLLABORATES_WITHAuthorAuthorCo-authorship

Constraints and indexes

Uniqueness constraints

CREATE CONSTRAINT node_id_unique
FOR (n:Node) REQUIRE n.id IS UNIQUE;

CREATE CONSTRAINT node_uri_unique
FOR (n:Node) REQUIRE n.uri IS UNIQUE;

CREATE CONSTRAINT wikidata_qid_unique
FOR (w:WikidataEntity) REQUIRE w.qid IS UNIQUE;

Performance indexes

-- Field search
CREATE INDEX field_label_idx FOR (f:Field) ON (f.label);
CREATE TEXT INDEX field_label_text FOR (f:Field) ON (f.label);

-- Authority search
CREATE INDEX authority_name_idx FOR (a:AuthorityRecord) ON (a.name);
CREATE INDEX authority_type_idx FOR (a:AuthorityRecord) ON (a.type);

-- Eprint lookup
CREATE INDEX eprint_created_idx FOR (p:Eprint) ON (p.createdAt);

Adapter usage

Connection

import { Neo4jAdapter } from '@/storage/neo4j/adapter.js';
import { createNeo4jConnection } from '@/storage/neo4j/connection.js';

const driver = await createNeo4jConnection({
uri: process.env.NEO4J_URI,
user: process.env.NEO4J_USER,
password: process.env.NEO4J_PASSWORD,
});

const adapter = new Neo4jAdapter(driver, logger);

Node operations

// Get node by ID
const node = await adapter.getNode('cs.AI');

// Get children (nodes with narrower edge pointing from this node)
// Note: hierarchy uses 'narrower' edges for parent-to-child relationships
const children = await adapter.getNodeChildren('cs');

// Get ancestors (path to root via broader edges)
const ancestors = await adapter.getNodeAncestors('cs.AI.ML');

// Search nodes by label
const matches = await adapter.searchNodes('artificial intelligence', {
kind: 'object',
subkind: 'field',
limit: 10,
includeAlternateLabels: true,
});

// Get related nodes (via related edges)
const related = await adapter.getRelatedNodes('cs.AI', {
limit: 5,
minWeight: 0.5,
});

// Get edges for a node
const edges = await adapter.getEdges('cs.AI', {
relationSlug: 'narrower', // for children, use 'narrower'
direction: 'outgoing',
});

Edge direction conventions

The knowledge graph uses SKOS-style relationship naming:

RelationDirectionDescription
broaderChild to parentPoints to the broader/parent node
narrowerParent to childPoints to the narrower/child node
relatedBidirectionalAssociative relationship
sameAsNode to externalEquivalence to external entity

When querying for children, use narrower edges with outgoing direction. When querying for parents, use broader edges with outgoing direction.

Authority records

Authority records are GraphNode entries with authority-related subkind values:

import { NodeRepository } from '@/storage/neo4j/node-repository.js';

const repo = new NodeRepository(driver, logger);

// Get by ID
const node = await repo.findById('authority-123');

// Search authority-type nodes
const results = await repo.search('machine learning', {
subkind: 'concept', // or 'institution', 'person'
limit: 20,
});

// Get with external links
const withLinks = await repo.findWithExternalIds('authority-123');
console.log(withLinks.externalIds); // [{ source: 'wikidata', value: 'Q2539' }]

Citation graph

import { CitationGraph } from '@/storage/neo4j/citation-graph.js';

const graph = new CitationGraph(driver, logger);

// Add citation
await graph.addCitation(citingUri, citedUri);

// Get citations for an eprint
const citations = await graph.getCitations(eprintUri, {
direction: 'outgoing', // or 'incoming'
limit: 50,
});

// Get citation count
const count = await graph.getCitationCount(eprintUri);

// Find co-citation clusters
const clusters = await graph.findCoCitationClusters(eprintUri, {
minSharedCitations: 3,
});

Automated citation extraction

When GROBID extracts citations from an eprint PDF, matched citations create CITES edges:

// After GROBID extraction resolves a DOI to a Chive eprint
await graph.addCitation(citingUri, citedUri);

User-curated citations from pub.chive.eprint.citation records with a chiveUri field also create CITES edges during indexing.

Graph algorithms

Neo4j Graph Data Science library powers advanced queries:

import { GraphAlgorithms } from '@/storage/neo4j/graph-algorithms.js';

const algorithms = new GraphAlgorithms(driver, logger);

// PageRank for field importance
const ranked = await algorithms.fieldPageRank({
dampingFactor: 0.85,
maxIterations: 20,
});

// Louvain community detection
const communities = await algorithms.detectCommunities('Person', 'COAUTHORED_WITH');

// Shortest path between fields
const path = await algorithms.shortestPath('cs.AI', 'physics.comp-ph');

// Node similarity
const similar = await algorithms.nodeSimilarity('Eprint', 'CLASSIFIED_AS', {
topK: 10,
});

Wikidata integration

import { WikidataConnector } from '@/storage/neo4j/wikidata-connector.js';

const wikidata = new WikidataConnector(driver, sparqlClient, logger);

// Reconcile authority record with Wikidata
const match = await wikidata.reconcile('machine learning', {
type: 'concept',
threshold: 0.8,
});

if (match) {
console.log(`Matched to ${match.qid}: ${match.label}`);
}

// Sync Wikidata properties
await wikidata.syncProperties('Q2539', ['description', 'aliases', 'sitelinks']);

// Get Wikidata hierarchy
const hierarchy = await wikidata.getHierarchy('Q2539', {
relationshipType: 'P279', // subclass of
depth: 3,
});

SPARQL queries

import { SparqlClient } from '@/storage/neo4j/sparql-client.js';

const sparql = new SparqlClient('https://query.wikidata.org/sparql');

// Find related Wikidata entities
const results = await sparql.query(`
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q11862829 . # instance of academic discipline
?item wdt:P361 wd:Q21198 . # part of computer science
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 100
`);

Facet management

import { FacetManager } from '@/storage/neo4j/facet-manager.js';

const facets = new FacetManager(driver, logger);

// Get PMEST dimensions
const dimensions = await facets.getDimensions();

// Get facet values for a dimension
const values = await facets.getFacetValues('matter', {
limit: 100,
includeCount: true,
});

// Assign facets to eprint
await facets.assignFacets(eprintUri, [
{ dimension: 'matter', value: 'cs.AI' },
{ dimension: 'energy', value: 'empirical' },
]);

Tag management

import { TagManager } from '@/storage/neo4j/tag-manager.js';

const tags = new TagManager(driver, logger);

// Get trending tags
const trending = await tags.getTrending({
period: '7d',
limit: 20,
});

// Get tag suggestions for eprint
const suggestions = await tags.getSuggestions(eprintUri, {
basedOn: 'similar-eprints',
limit: 10,
});

// Check for spam tags
import { TagSpamDetector } from '@/storage/neo4j/tag-spam-detector.js';

const detector = new TagSpamDetector(driver, logger);
const isSpam = await detector.check(tagText, authorDid);

Proposal handling

import { ProposalHandler } from '@/storage/neo4j/proposal-handler.js';

const proposals = new ProposalHandler(driver, storage, logger);

// Create node proposal
const nodeProposal = await proposals.createNodeProposal({
proposalType: 'create',
kind: 'object',
subkind: 'field',
proposedNode: {
id: 'cs.QML',
label: 'Quantum Machine Learning',
alternateLabels: ['QML'],
description: 'Algorithms combining quantum computing with ML',
},
rationale: 'Emerging interdisciplinary field',
proposerDid: userDid,
});

// Create edge proposal (for parent relationship)
const edgeProposal = await proposals.createEdgeProposal({
proposalType: 'create',
proposedEdge: {
sourceUri:
'at://did:plc:chive-governance/pub.chive.graph.node/c1d2e3f4-a5b6-7890-1234-567890abcdef',
targetUri:
'at://did:plc:chive-governance/pub.chive.graph.node/726c5017-723e-5ae5-a1e2-f12e636eb709',
relationSlug: 'broader',
weight: 1.0,
},
rationale: 'QML is a subfield of AI',
proposerDid: userDid,
});

// Apply approved proposal
await proposals.applyProposal(proposalId);

// Revert proposal (if needed)
await proposals.revertProposal(proposalId);

Configuration

Environment variables:

VariableDefaultDescription
NEO4J_URIbolt://localhost:7687Bolt connection URI
NEO4J_USERneo4jUsername
NEO4J_PASSWORDRequiredPassword
NEO4J_DATABASEneo4jDatabase name
NEO4J_MAX_POOL_SIZE50Connection pool size

Setup

# Initialize schema and bootstrap data
tsx scripts/db/setup-neo4j.ts

# Or via npm script
pnpm db:setup:neo4j

Bootstrap data

Initial data includes:

  • Root field node (id: 'root')
  • Top-level field categories (cs, physics, math, bio, etc.)
  • 10 PMEST facet dimension templates
  • Initial authority records from LCSH

Testing

# Integration tests
pnpm test tests/integration/storage/neo4j-operations.test.ts

# Citation graph tests
pnpm test tests/unit/storage/neo4j/citation-graph.test.ts

# Algorithm tests
pnpm test tests/integration/storage/neo4j-algorithms.test.ts

Monitoring

Query performance

// Show slow queries
CALL dbms.listQueries() YIELD query, elapsedTimeMillis
WHERE elapsedTimeMillis > 1000
RETURN query, elapsedTimeMillis;

// Profile a query
PROFILE MATCH (f:Field)-[:PARENT_OF*]->(child)
WHERE f.id = 'cs'
RETURN child;

Database statistics

// Node counts by label
MATCH (n)
RETURN labels(n)[0] AS label, count(*) AS count
ORDER BY count DESC;

// Relationship counts
MATCH ()-[r]->()
RETURN type(r) AS type, count(*) AS count
ORDER BY count DESC;

Next steps