Elasticsearch storage

Elasticsearch provides full-text search and faceted filtering for Chive. Like all Chive databases, it stores indexes only.

Index architecture

Index naming

Preprint indexes use time-based naming with aliases:

preprints-000001  ← Current write index
preprints-000002  ← Rolled over index
preprints         ← Write alias (points to current)
preprints-read    ← Read alias (points to all)

Index template

The preprints template applies to all preprints-* indexes:

{
  "index_patterns": ["preprints-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 2,
      "analysis": {
        "analyzer": {
          "preprint_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase", "porter_stem", "asciifolding"]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "uri": { "type": "keyword" },
        "title": {
          "type": "text",
          "analyzer": "preprint_analyzer",
          "fields": {
            "keyword": { "type": "keyword" },
            "suggest": { "type": "completion" }
          }
        },
        "abstract": { "type": "text", "analyzer": "preprint_analyzer" },
        "keywords": { "type": "keyword" },
        "author_did": { "type": "keyword" },
        "author_name": { "type": "text" },
        "fields": { "type": "keyword" },
        "created_at": { "type": "date" },
        "indexed_at": { "type": "date" }
      }
    }
  }
}

Index lifecycle management

ILM policy

Indexes rotate through hot/warm/cold tiers:

Phase	Duration	Actions
Hot	0-30 days	Rollover at 50GB or 30 days
Warm	30-90 days	Force merge to 1 segment, reduce replicas
Cold	90+ days	Move to cold storage, reduce priority

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "30d"
          }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "forcemerge": { "max_num_segments": 1 },
          "allocate": { "number_of_replicas": 1 }
        }
      },
      "cold": {
        "min_age": "90d",
        "actions": {
          "allocate": { "number_of_replicas": 0 }
        }
      }
    }
  }
}

Adapter usage

Indexing documents

import { ElasticsearchAdapter } from '@/storage/elasticsearch/adapter.js';

const adapter = new ElasticsearchAdapter(client, logger);

// Index a preprint
await adapter.indexPreprint({
  uri: 'at://did:plc:abc/pub.chive.preprint.submission/xyz',
  title: 'Attention Is All You Need',
  abstract: 'We propose a new simple network architecture...',
  keywords: ['transformers', 'attention', 'neural networks'],
  authorDid: 'did:plc:abc',
  authorName: 'Vaswani et al.',
  fields: ['cs.AI', 'cs.CL'],
  createdAt: new Date('2017-06-12'),
});

// Bulk index
await adapter.bulkIndex(preprints, { chunkSize: 500 });

// Delete
await adapter.delete(uri);

Searching

// Full-text search
const results = await adapter.search({
  query: 'attention mechanisms transformer',
  fields: ['title^2', 'abstract'],
  limit: 20,
  offset: 0,
});

// With filters
const filtered = await adapter.search({
  query: 'machine learning',
  filters: {
    fields: ['cs.AI', 'cs.LG'],
    dateRange: { from: '2024-01-01' },
    authors: ['did:plc:author1'],
  },
  sort: [{ created_at: 'desc' }],
});

Search query builder

The SearchQueryBuilder constructs Elasticsearch queries:

import { SearchQueryBuilder } from '@/storage/elasticsearch/search-query-builder.js';

const builder = new SearchQueryBuilder()
  .query('neural networks')
  .fields(['title^3', 'abstract^1', 'keywords^2'])
  .filter('fields', ['cs.AI'])
  .filter('dateRange', { from: '2024-01-01', to: '2024-12-31' })
  .highlight(['title', 'abstract'])
  .sort('created_at', 'desc')
  .paginate(20, 0);

const esQuery = builder.build();

Query types

Method	Elasticsearch Query
`.query(text)`	`multi_match` with cross_fields
`.phrase(text)`	`match_phrase` for exact sequences
`.prefix(text)`	`prefix` for autocomplete
`.filter(field, value)`	`term` in filter context
`.range(field, opts)`	`range` query

Faceted search

Aggregations

import { AggregationsBuilder } from '@/storage/elasticsearch/aggregations-builder.js';

const aggs = new AggregationsBuilder()
  .terms('fields', { size: 20 })
  .terms('keywords', { size: 50 })
  .dateHistogram('created_at', { interval: 'month' })
  .build();

const result = await adapter.searchWithAggregations({
  query: 'machine learning',
  aggregations: aggs,
});

// Access facet counts
for (const bucket of result.aggregations.fields.buckets) {
  console.log(`${bucket.key}: ${bucket.doc_count}`);
}

Chive uses PMEST classification for faceted navigation:

Dimension	Examples
Personality	Author, institution, funder
Matter	Subject field, methodology
Energy	Research type (theoretical, empirical)
Space	Geographic focus, language
Time	Publication date, era studied

Autocomplete

import { AutocompleteService } from '@/storage/elasticsearch/autocomplete-service.js';

const autocomplete = new AutocompleteService(client, logger);

// Title suggestions
const suggestions = await autocomplete.suggest('atten', {
  field: 'title.suggest',
  size: 8,
});

// Keyword suggestions
const keywords = await autocomplete.suggestKeywords('mach', { size: 10 });

// Author suggestions
const authors = await autocomplete.suggestAuthors('vas', { size: 5 });

Query caching

import { QueryCache } from '@/storage/elasticsearch/query-cache.js';

const cache = new QueryCache(redis, {
  ttl: 300, // 5 minutes
  keyPrefix: 'es:cache:',
});

// Cached search
const results = await cache.getOrFetch(
  { query: 'neural networks', limit: 20 },
  () => adapter.search({ query: 'neural networks', limit: 20 })
);

Index management

import { IndexManager } from '@/storage/elasticsearch/index-manager.js';

const manager = new IndexManager(client, logger);

// Create index with template
await manager.createIndex('preprints-000001');

// Apply template updates
await manager.updateTemplate('preprints', templateDefinition);

// Reindex (e.g., after mapping changes)
await manager.reindex('preprints-000001', 'preprints-000002');

// Force merge for read optimization
await manager.forceMerge('preprints-000001', { maxSegments: 1 });

Pipelines

Ingest pipeline

Pre-process documents before indexing:

{
  "description": "Preprint ingest pipeline",
  "processors": [
    {
      "set": {
        "field": "indexed_at",
        "value": "{{_ingest.timestamp}}"
      }
    },
    {
      "lowercase": {
        "field": "keywords"
      }
    }
  ]
}

Usage

await adapter.indexPreprint(doc, { pipeline: 'preprint-ingest' });

Configuration

Environment variables:

Variable	Default	Description
`ELASTICSEARCH_URL`	`http://localhost:9200`	Cluster URL
`ELASTICSEARCH_USER`	None	Username (optional)
`ELASTICSEARCH_PASSWORD`	None	Password (optional)
`ELASTICSEARCH_INDEX_PREFIX`	`chive`	Index name prefix
`ELASTICSEARCH_SHARDS`	`3`	Number of shards
`ELASTICSEARCH_REPLICAS`	`2`	Number of replicas

Setup

# Apply templates and create initial index
tsx scripts/db/setup-elasticsearch.ts

# Or via npm script
pnpm db:setup:elasticsearch

Monitoring

Cluster health

curl http://localhost:9200/_cluster/health?pretty

Index stats

curl http://localhost:9200/preprints/_stats?pretty

Slow queries

Enable slow query logging:

{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "5s"
}

Rebuilding

If search indexes need rebuilding:

# 1. Create new index
tsx scripts/db/create-index.ts preprints-new

# 2. Reindex from PostgreSQL
tsx scripts/db/reindex-from-pg.ts --target preprints-new

# 3. Switch alias
tsx scripts/db/switch-alias.ts preprints preprints-new

# 4. Delete old index
curl -X DELETE http://localhost:9200/preprints-old

Testing

# Integration tests
pnpm test tests/integration/storage/elasticsearch-search.test.ts

# Search relevance tests
pnpm test tests/integration/storage/elasticsearch-relevance.test.ts

PostgreSQL Storage: Primary index storage
SearchService: Search service layer
API Layer: Search endpoints

Index architecture​

Index naming​

Index template​

Index lifecycle management​

ILM policy​

Adapter usage​

Indexing documents​

Searching​

Search query builder​

Query types​

Faceted search​

Aggregations​

PMEST facets​

Autocomplete​

Query caching​

Index management​

Pipelines​

Ingest pipeline​

Usage​

Configuration​

Setup​

Monitoring​

Cluster health​

Index stats​

Slow queries​

Rebuilding​

Testing​

Related documentation​