Troubleshooting

Common issues and their solutions when operating Chive.

Quick diagnostics

Health check

# API health
curl -s https://api.chive.pub/health | jq

# Detailed health
curl -s https://api.chive.pub/health/ready | jq

Service status

# Kubernetes pods
kubectl get pods -n chive

# Docker containers
docker compose ps

API issues

5xx errors

Symptoms: API returning 500 errors, high error rate in metrics.

Diagnosis:

# Check API logs
kubectl logs -f deploy/chive-api -n chive | grep -i error

# Check error rate
curl -s localhost:9090/metrics | grep http_requests_total | grep status=\"5

Common causes:

Cause	Solution
Database connection exhausted	Increase pool size, add PgBouncer
Memory exhaustion	Increase memory limits, check for leaks
External API failures	Check circuit breaker status

Slow responses

Symptoms: P95 latency above 500ms.

Diagnosis:

# Check slow query log
kubectl logs deploy/chive-api -n chive | grep "slow query"

# Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

Solutions:

Add missing database indexes
Enable query caching
Scale API replicas horizontally

Rate limiting issues

Symptoms: Users receiving 429 errors.

Diagnosis:

# Check rate limit counters
redis-cli KEYS "ratelimit:*" | head -20

# Check specific user
redis-cli GET "ratelimit:api:did:plc:example"

Solutions:

Increase rate limits for authenticated users
Add user to higher tier
Check for misbehaving clients

Firehose issues

Processing lag

Symptoms: firehose_lag_seconds metric increasing.

Diagnosis:

# Check indexer logs
kubectl logs deploy/chive-indexer -n chive

# Check cursor position
redis-cli GET firehose:cursor

Solutions:

Cause	Solution
Slow event handlers	Optimize or offload to workers
Database bottleneck	Scale database, add indexes
Network issues	Check relay connectivity

Connection drops

Symptoms: Frequent reconnects in indexer logs.

Diagnosis:

# Check connection errors
kubectl logs deploy/chive-indexer -n chive | grep -i "disconnect\|reconnect"

Solutions:

Check network connectivity to relay
Increase connection timeout
Verify relay URL is correct

Dead letter queue growth

Symptoms: Events accumulating in DLQ.

Diagnosis:

SELECT error, count(*)
FROM firehose_dlq
GROUP BY error
ORDER BY count(*) DESC;

Solutions:

Fix the error causing failures
Retry DLQ events: pnpm dlq:retry
Clear stale events: pnpm dlq:clear --older-than 7d

Database issues

PostgreSQL

Connection exhaustion

Symptoms: "too many connections" errors.

-- Check active connections
SELECT count(*), state
FROM pg_stat_activity
GROUP BY state;

-- Kill idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < NOW() - INTERVAL '10 minutes';

Slow queries

-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Check missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'preprints_index';

Lock contention

-- Find blocked queries
SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
  ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype;

Elasticsearch

Cluster health

# Check cluster status
curl -s localhost:9200/_cluster/health?pretty

# Check shard allocation
curl -s localhost:9200/_cat/shards?v

Unassigned shards

# Find unassigned shards
curl -s localhost:9200/_cat/shards | grep UNASSIGNED

# Explain allocation
curl -s localhost:9200/_cluster/allocation/explain?pretty

Solutions:

Check disk space
Increase cluster.routing.allocation.disk.threshold
Add more nodes

Index corruption

# Rebuild index from PostgreSQL
tsx scripts/db/reindex-from-pg.ts --target preprints-new

# Switch alias
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
  "actions": [
    { "remove": { "index": "preprints-old", "alias": "preprints" }},
    { "add": { "index": "preprints-new", "alias": "preprints" }}
  ]
}'

Neo4j

Memory issues

-- Check memory usage
CALL dbms.queryJmx("org.neo4j:name=Page cache")
YIELD name, attributes
RETURN attributes.Faults.value AS faults,
       attributes.Evictions.value AS evictions;

Solution: Increase dbms.memory.pagecache.size.

Slow queries

-- Profile query
PROFILE MATCH (f:Field)-[:PARENT_OF*]->(child)
WHERE f.id = 'cs'
RETURN child;

Redis

Memory pressure

# Check memory usage
redis-cli INFO memory

# Find big keys
redis-cli --bigkeys

Solutions:

Increase maxmemory
Set appropriate maxmemory-policy (default: volatile-lru)
Reduce TTLs on cached data

Connection issues

# Check client list
redis-cli CLIENT LIST

# Check connection count
redis-cli INFO clients

Worker issues

Job failures

Diagnosis:

# Check failed jobs
redis-cli LRANGE bull:indexing:failed 0 10

Solutions:

Check worker logs for specific errors
Retry failed jobs: pnpm queue:retry indexing
Clear stale jobs: pnpm queue:clean indexing

Queue backup

Symptoms: Jobs accumulating faster than processed.

# Check queue depth
redis-cli LLEN bull:indexing:wait

Solutions:

Scale worker replicas
Increase concurrency setting
Optimize job processing

Frontend issues

Build failures

# Clear cache and rebuild
rm -rf .next
pnpm build

Hydration errors

Check for:

Server/client content mismatch
Date/time rendering differences
Browser extension interference

Recovery procedures

Full index rebuild

If all indexes are corrupted:

# 1. Stop indexer
kubectl scale deploy/chive-indexer --replicas=0 -n chive

# 2. Truncate PostgreSQL indexes
psql -c "TRUNCATE preprints_index, reviews_index, endorsements_index CASCADE;"

# 3. Reset firehose cursor
psql -c "UPDATE firehose_cursor SET cursor = 0;"

# 4. Clear Elasticsearch
curl -X DELETE "localhost:9200/preprints-*"
tsx scripts/db/setup-elasticsearch.ts

# 5. Clear Neo4j graph data (keep schema)
cypher-shell "MATCH (n) DETACH DELETE n;"

# 6. Restart indexer
kubectl scale deploy/chive-indexer --replicas=1 -n chive

# 7. Monitor rebuild progress
watch 'redis-cli GET firehose:cursor'

Database restore

# PostgreSQL
pg_restore -d chive backup.dump

# Elasticsearch
curl -X POST "localhost:9200/_snapshot/backup/snapshot_1/_restore"

# Neo4j
neo4j-admin restore --from=/backup/neo4j --database=neo4j

Getting help

Logs to collect

When reporting issues, include:

Service logs (last 100 lines)
Metrics snapshot
Database query stats
Error traces

Support channels

GitHub Issues: https://github.com/chive-pub/chive/issues
Discussions: https://github.com/chive-pub/chive/discussions

Deployment: Setup reference
Monitoring: Observability tools
Scaling: Performance tuning

Quick diagnostics​

Health check​

Service status​

API issues​

5xx errors​

Slow responses​

Rate limiting issues​

Firehose issues​

Processing lag​

Connection drops​

Dead letter queue growth​

Database issues​

PostgreSQL​

Connection exhaustion​

Slow queries​

Lock contention​

Elasticsearch​

Cluster health​

Unassigned shards​

Index corruption​

Neo4j​

Memory issues​

Slow queries​

Redis​

Memory pressure​

Connection issues​

Worker issues​

Job failures​

Queue backup​

Frontend issues​

Build failures​

Hydration errors​

Recovery procedures​

Full index rebuild​

Database restore​

Getting help​

Logs to collect​

Support channels​

Related documentation​

Quick diagnostics

Health check

Service status

API issues

5xx errors

Slow responses

Rate limiting issues

Firehose issues

Processing lag

Connection drops

Dead letter queue growth

Database issues

PostgreSQL

Connection exhaustion

Slow queries

Lock contention

Elasticsearch

Cluster health

Unassigned shards

Index corruption

Neo4j

Memory issues

Slow queries

Redis

Memory pressure

Connection issues

Worker issues

Job failures

Queue backup

Frontend issues

Build failures

Hydration errors

Recovery procedures

Full index rebuild

Database restore

Getting help

Logs to collect

Support channels

Related documentation