Troubleshooting
Common issues and their solutions when operating Chive.
Quick diagnostics
Health check
# API health
curl -s https://api.chive.pub/health | jq
# Detailed health
curl -s https://api.chive.pub/health/ready | jq
Service status
# Kubernetes pods
kubectl get pods -n chive
# Docker containers
docker compose ps
API issues
5xx errors
Symptoms: API returning 500 errors, high error rate in metrics.
Diagnosis:
# Check API logs
kubectl logs -f deploy/chive-api -n chive | grep -i error
# Check error rate
curl -s localhost:9090/metrics | grep http_requests_total | grep status=\"5
Common causes:
| Cause | Solution |
|---|---|
| Database connection exhausted | Increase pool size, add PgBouncer |
| Memory exhaustion | Increase memory limits, check for leaks |
| External API failures | Check circuit breaker status |
Slow responses
Symptoms: P95 latency above 500ms.
Diagnosis:
# Check slow query log
kubectl logs deploy/chive-api -n chive | grep "slow query"
# Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
Solutions:
- Add missing database indexes
- Enable query caching
- Scale API replicas horizontally
Rate limiting issues
Symptoms: Users receiving 429 errors.
Diagnosis:
# Check rate limit counters
redis-cli KEYS "ratelimit:*" | head -20
# Check specific user
redis-cli GET "ratelimit:api:did:plc:example"
Solutions:
- Increase rate limits for authenticated users
- Add user to higher tier
- Check for misbehaving clients
Firehose issues
Processing lag
Symptoms: firehose_lag_seconds metric increasing.
Diagnosis:
# Check indexer logs
kubectl logs deploy/chive-indexer -n chive
# Check cursor position
redis-cli GET firehose:cursor
Solutions:
| Cause | Solution |
|---|---|
| Slow event handlers | Optimize or offload to workers |
| Database bottleneck | Scale database, add indexes |
| Network issues | Check relay connectivity |
Connection drops
Symptoms: Frequent reconnects in indexer logs.
Diagnosis:
# Check connection errors
kubectl logs deploy/chive-indexer -n chive | grep -i "disconnect\|reconnect"
Solutions:
- Check network connectivity to relay
- Increase connection timeout
- Verify relay URL is correct
Dead letter queue growth
Symptoms: Events accumulating in DLQ.
Diagnosis:
SELECT error, count(*)
FROM firehose_dlq
GROUP BY error
ORDER BY count(*) DESC;
Solutions:
- Fix the error causing failures
- Retry DLQ events:
pnpm dlq:retry - Clear stale events:
pnpm dlq:clear --older-than 7d
Database issues
PostgreSQL
Connection exhaustion
Symptoms: "too many connections" errors.
-- Check active connections
SELECT count(*), state
FROM pg_stat_activity
GROUP BY state;
-- Kill idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < NOW() - INTERVAL '10 minutes';
Slow queries
-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Check missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'preprints_index';
Lock contention
-- Find blocked queries
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype;
Elasticsearch
Cluster health
# Check cluster status
curl -s localhost:9200/_cluster/health?pretty
# Check shard allocation
curl -s localhost:9200/_cat/shards?v
Unassigned shards
# Find unassigned shards
curl -s localhost:9200/_cat/shards | grep UNASSIGNED
# Explain allocation
curl -s localhost:9200/_cluster/allocation/explain?pretty
Solutions:
- Check disk space
- Increase
cluster.routing.allocation.disk.threshold - Add more nodes
Index corruption
# Rebuild index from PostgreSQL
tsx scripts/db/reindex-from-pg.ts --target preprints-new
# Switch alias
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
"actions": [
{ "remove": { "index": "preprints-old", "alias": "preprints" }},
{ "add": { "index": "preprints-new", "alias": "preprints" }}
]
}'
Neo4j
Memory issues
-- Check memory usage
CALL dbms.queryJmx("org.neo4j:name=Page cache")
YIELD name, attributes
RETURN attributes.Faults.value AS faults,
attributes.Evictions.value AS evictions;
Solution: Increase dbms.memory.pagecache.size.
Slow queries
-- Profile query
PROFILE MATCH (f:Field)-[:PARENT_OF*]->(child)
WHERE f.id = 'cs'
RETURN child;
Redis
Memory pressure
# Check memory usage
redis-cli INFO memory
# Find big keys
redis-cli --bigkeys
Solutions:
- Increase
maxmemory - Set appropriate
maxmemory-policy(default:volatile-lru) - Reduce TTLs on cached data
Connection issues
# Check client list
redis-cli CLIENT LIST
# Check connection count
redis-cli INFO clients
Worker issues
Job failures
Diagnosis:
# Check failed jobs
redis-cli LRANGE bull:indexing:failed 0 10
Solutions:
- Check worker logs for specific errors
- Retry failed jobs:
pnpm queue:retry indexing - Clear stale jobs:
pnpm queue:clean indexing
Queue backup
Symptoms: Jobs accumulating faster than processed.
# Check queue depth
redis-cli LLEN bull:indexing:wait
Solutions:
- Scale worker replicas
- Increase concurrency setting
- Optimize job processing
Frontend issues
Build failures
# Clear cache and rebuild
rm -rf .next
pnpm build
Hydration errors
Check for:
- Server/client content mismatch
- Date/time rendering differences
- Browser extension interference
Recovery procedures
Full index rebuild
If all indexes are corrupted:
# 1. Stop indexer
kubectl scale deploy/chive-indexer --replicas=0 -n chive
# 2. Truncate PostgreSQL indexes
psql -c "TRUNCATE preprints_index, reviews_index, endorsements_index CASCADE;"
# 3. Reset firehose cursor
psql -c "UPDATE firehose_cursor SET cursor = 0;"
# 4. Clear Elasticsearch
curl -X DELETE "localhost:9200/preprints-*"
tsx scripts/db/setup-elasticsearch.ts
# 5. Clear Neo4j graph data (keep schema)
cypher-shell "MATCH (n) DETACH DELETE n;"
# 6. Restart indexer
kubectl scale deploy/chive-indexer --replicas=1 -n chive
# 7. Monitor rebuild progress
watch 'redis-cli GET firehose:cursor'
Database restore
# PostgreSQL
pg_restore -d chive backup.dump
# Elasticsearch
curl -X POST "localhost:9200/_snapshot/backup/snapshot_1/_restore"
# Neo4j
neo4j-admin restore --from=/backup/neo4j --database=neo4j
Getting help
Logs to collect
When reporting issues, include:
- Service logs (last 100 lines)
- Metrics snapshot
- Database query stats
- Error traces
Support channels
- GitHub Issues: https://github.com/chive-pub/chive/issues
- Discussions: https://github.com/chive-pub/chive/discussions
Related documentation
- Deployment: Setup reference
- Monitoring: Observability tools
- Scaling: Performance tuning