Skip to main content

Monitoring

Chive uses OpenTelemetry for observability with Prometheus metrics and Grafana dashboards.

Architecture

Metrics

Prometheus configuration

# prometheus.yml
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'chive-api'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: chive-api
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: '9090'
action: keep

Key metrics

API metrics

MetricTypeDescription
http_requests_totalCounterTotal HTTP requests
http_request_duration_secondsHistogramRequest latency
http_requests_in_flightGaugeCurrent active requests

Firehose metrics

MetricTypeDescription
firehose_events_totalCounterEvents received
firehose_events_processedCounterEvents successfully processed
firehose_lag_secondsGaugeProcessing lag
firehose_cursorGaugeCurrent cursor position

Database metrics

MetricTypeDescription
db_pool_connectionsGaugeActive connections
db_query_duration_secondsHistogramQuery latency
db_errors_totalCounterQuery errors

Cache metrics

MetricTypeDescription
cache_hits_totalCounterCache hits
cache_misses_totalCounterCache misses
cache_size_bytesGaugeCache memory usage

Custom metrics

Add custom metrics in your code:

import { metrics } from '@/observability/metrics.js';

// Counter
metrics.incrementCounter('custom_events_total', { type: 'example' });

// Histogram
metrics.observeHistogram('processing_duration_seconds', durationMs / 1000);

// Gauge
metrics.setGauge('queue_depth', queueLength);

Grafana dashboards

Pre-built dashboards

DashboardIDDescription
Chive Overview1High-level system health
API Performance2Request latency, error rates
Firehose Status3Event processing, lag
Database Health4Connections, query performance

Dashboard configuration

Grafana dashboards are configured via Kubernetes ConfigMaps in k8s/monitoring/grafana-dashboards.yaml.

Key panels

API overview

  • Requests per second
  • Error rate (%)
  • P50/P95/P99 latency
  • Active connections

Firehose status

  • Events per second
  • Processing lag
  • Error rate
  • Queue depth

Alerting

Prometheus alerting rules

# alerts.yml
groups:
- name: chive
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
description: Error rate is {{ $value | humanizePercentage }}

- alert: FirehoseLag
expr: firehose_lag_seconds > 300
for: 10m
labels:
severity: warning
annotations:
summary: Firehose processing lag
description: Lag is {{ $value | humanizeDuration }}

- alert: DatabaseConnectionPoolExhausted
expr: db_pool_connections / db_pool_max_connections > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: Database connection pool near capacity

Alert destinations

Configure in Alertmanager:

# alertmanager.yml
route:
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty'

receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#chive-alerts'
send_resolved: true

- name: 'pagerduty'
pagerduty_configs:
- service_key: ${PAGERDUTY_KEY}

Distributed tracing

OpenTelemetry traces are exported via the OTLP protocol. Trace context propagates automatically through:

  • HTTP requests (via W3C Trace Context headers)
  • Queue jobs (via metadata)
  • Database queries (via spans)

Key operations

OperationDescription
HTTP GET /xrpc/*XRPC requests
firehose.processEvent processing
db.queryDatabase queries
cache.get/setCache operations
external.*.callExternal API calls

Logging

Log aggregation

# fluent-bit config
[INPUT]
Name tail
Path /var/log/containers/chive-*.log
Parser docker
Tag chive.*

[OUTPUT]
Name loki
Match chive.*
Host loki.monitoring.svc
Port 3100
Labels app=chive

Log levels

LevelWhen to use
errorUnexpected errors requiring attention
warnDegraded performance or retries
infoNormal operations (requests, events)
debugDetailed debugging (not in production)

Structured logging

logger.info('Eprint indexed', {
uri: eprint.uri,
authorDid: eprint.authorDid,
duration: indexDurationMs,
});

SLOs and SLIs

Service level indicators

SLIMeasurement
Availabilitysum(rate(http_requests_total{status!~"5.."})) / sum(rate(http_requests_total))
Latencyhistogram_quantile(0.95, rate(http_request_duration_seconds_bucket))
Throughputsum(rate(http_requests_total))

Service level objectives

SLOTarget
API availability99.9%
P95 latency< 500ms
Firehose lag< 5 minutes
Error rate< 0.1%

Health monitoring

Kubernetes monitoring

# View pod status
kubectl get pods -n chive

# View events
kubectl get events -n chive --sort-by=.lastTimestamp

# View resource usage
kubectl top pods -n chive

Database monitoring

# PostgreSQL
psql -c "SELECT * FROM pg_stat_activity;"

# Elasticsearch
curl http://localhost:9200/_cluster/health?pretty

# Neo4j
cypher-shell "CALL dbms.queryJmx('org.neo4j:*');"

# Redis
redis-cli INFO

Next steps