Skip to main content

Elasticsearch storage

Elasticsearch provides full-text search and faceted filtering for Chive. Like all Chive databases, it stores indexes only.

Index architecture

Index naming

Preprint indexes use time-based naming with aliases:

preprints-000001  ← Current write index
preprints-000002 ← Rolled over index
preprints ← Write alias (points to current)
preprints-read ← Read alias (points to all)

Index template

The preprints template applies to all preprints-* indexes:

{
"index_patterns": ["preprints-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 2,
"analysis": {
"analyzer": {
"preprint_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "porter_stem", "asciifolding"]
}
}
}
},
"mappings": {
"properties": {
"uri": { "type": "keyword" },
"title": {
"type": "text",
"analyzer": "preprint_analyzer",
"fields": {
"keyword": { "type": "keyword" },
"suggest": { "type": "completion" }
}
},
"abstract": { "type": "text", "analyzer": "preprint_analyzer" },
"keywords": { "type": "keyword" },
"author_did": { "type": "keyword" },
"author_name": { "type": "text" },
"fields": { "type": "keyword" },
"created_at": { "type": "date" },
"indexed_at": { "type": "date" }
}
}
}
}

Index lifecycle management

ILM policy

Indexes rotate through hot/warm/cold tiers:

PhaseDurationActions
Hot0-30 daysRollover at 50GB or 30 days
Warm30-90 daysForce merge to 1 segment, reduce replicas
Cold90+ daysMove to cold storage, reduce priority
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "30d"
}
}
},
"warm": {
"min_age": "30d",
"actions": {
"forcemerge": { "max_num_segments": 1 },
"allocate": { "number_of_replicas": 1 }
}
},
"cold": {
"min_age": "90d",
"actions": {
"allocate": { "number_of_replicas": 0 }
}
}
}
}
}

Adapter usage

Indexing documents

import { ElasticsearchAdapter } from '@/storage/elasticsearch/adapter.js';

const adapter = new ElasticsearchAdapter(client, logger);

// Index a preprint
await adapter.indexPreprint({
uri: 'at://did:plc:abc/pub.chive.preprint.submission/xyz',
title: 'Attention Is All You Need',
abstract: 'We propose a new simple network architecture...',
keywords: ['transformers', 'attention', 'neural networks'],
authorDid: 'did:plc:abc',
authorName: 'Vaswani et al.',
fields: ['cs.AI', 'cs.CL'],
createdAt: new Date('2017-06-12'),
});

// Bulk index
await adapter.bulkIndex(preprints, { chunkSize: 500 });

// Delete
await adapter.delete(uri);

Searching

// Full-text search
const results = await adapter.search({
query: 'attention mechanisms transformer',
fields: ['title^2', 'abstract'],
limit: 20,
offset: 0,
});

// With filters
const filtered = await adapter.search({
query: 'machine learning',
filters: {
fields: ['cs.AI', 'cs.LG'],
dateRange: { from: '2024-01-01' },
authors: ['did:plc:author1'],
},
sort: [{ created_at: 'desc' }],
});

Search query builder

The SearchQueryBuilder constructs Elasticsearch queries:

import { SearchQueryBuilder } from '@/storage/elasticsearch/search-query-builder.js';

const builder = new SearchQueryBuilder()
.query('neural networks')
.fields(['title^3', 'abstract^1', 'keywords^2'])
.filter('fields', ['cs.AI'])
.filter('dateRange', { from: '2024-01-01', to: '2024-12-31' })
.highlight(['title', 'abstract'])
.sort('created_at', 'desc')
.paginate(20, 0);

const esQuery = builder.build();

Query types

MethodElasticsearch Query
.query(text)multi_match with cross_fields
.phrase(text)match_phrase for exact sequences
.prefix(text)prefix for autocomplete
.filter(field, value)term in filter context
.range(field, opts)range query

Aggregations

import { AggregationsBuilder } from '@/storage/elasticsearch/aggregations-builder.js';

const aggs = new AggregationsBuilder()
.terms('fields', { size: 20 })
.terms('keywords', { size: 50 })
.dateHistogram('created_at', { interval: 'month' })
.build();

const result = await adapter.searchWithAggregations({
query: 'machine learning',
aggregations: aggs,
});

// Access facet counts
for (const bucket of result.aggregations.fields.buckets) {
console.log(`${bucket.key}: ${bucket.doc_count}`);
}

PMEST facets

Chive uses PMEST classification for faceted navigation:

DimensionExamples
PersonalityAuthor, institution, funder
MatterSubject field, methodology
EnergyResearch type (theoretical, empirical)
SpaceGeographic focus, language
TimePublication date, era studied

Autocomplete

import { AutocompleteService } from '@/storage/elasticsearch/autocomplete-service.js';

const autocomplete = new AutocompleteService(client, logger);

// Title suggestions
const suggestions = await autocomplete.suggest('atten', {
field: 'title.suggest',
size: 8,
});

// Keyword suggestions
const keywords = await autocomplete.suggestKeywords('mach', { size: 10 });

// Author suggestions
const authors = await autocomplete.suggestAuthors('vas', { size: 5 });

Query caching

import { QueryCache } from '@/storage/elasticsearch/query-cache.js';

const cache = new QueryCache(redis, {
ttl: 300, // 5 minutes
keyPrefix: 'es:cache:',
});

// Cached search
const results = await cache.getOrFetch(
{ query: 'neural networks', limit: 20 },
() => adapter.search({ query: 'neural networks', limit: 20 })
);

Index management

import { IndexManager } from '@/storage/elasticsearch/index-manager.js';

const manager = new IndexManager(client, logger);

// Create index with template
await manager.createIndex('preprints-000001');

// Apply template updates
await manager.updateTemplate('preprints', templateDefinition);

// Reindex (e.g., after mapping changes)
await manager.reindex('preprints-000001', 'preprints-000002');

// Force merge for read optimization
await manager.forceMerge('preprints-000001', { maxSegments: 1 });

Pipelines

Ingest pipeline

Pre-process documents before indexing:

{
"description": "Preprint ingest pipeline",
"processors": [
{
"set": {
"field": "indexed_at",
"value": "{{_ingest.timestamp}}"
}
},
{
"lowercase": {
"field": "keywords"
}
}
]
}

Usage

await adapter.indexPreprint(doc, { pipeline: 'preprint-ingest' });

Configuration

Environment variables:

VariableDefaultDescription
ELASTICSEARCH_URLhttp://localhost:9200Cluster URL
ELASTICSEARCH_USERNoneUsername (optional)
ELASTICSEARCH_PASSWORDNonePassword (optional)
ELASTICSEARCH_INDEX_PREFIXchiveIndex name prefix
ELASTICSEARCH_SHARDS3Number of shards
ELASTICSEARCH_REPLICAS2Number of replicas

Setup

# Apply templates and create initial index
tsx scripts/db/setup-elasticsearch.ts

# Or via npm script
pnpm db:setup:elasticsearch

Monitoring

Cluster health

curl http://localhost:9200/_cluster/health?pretty

Index stats

curl http://localhost:9200/preprints/_stats?pretty

Slow queries

Enable slow query logging:

{
"index.search.slowlog.threshold.query.warn": "10s",
"index.search.slowlog.threshold.query.info": "5s"
}

Rebuilding

If search indexes need rebuilding:

# 1. Create new index
tsx scripts/db/create-index.ts preprints-new

# 2. Reindex from PostgreSQL
tsx scripts/db/reindex-from-pg.ts --target preprints-new

# 3. Switch alias
tsx scripts/db/switch-alias.ts preprints preprints-new

# 4. Delete old index
curl -X DELETE http://localhost:9200/preprints-old

Testing

# Integration tests
pnpm test tests/integration/storage/elasticsearch-search.test.ts

# Search relevance tests
pnpm test tests/integration/storage/elasticsearch-relevance.test.ts