Skip to main content

Monitoring

Chive uses OpenTelemetry for observability with Prometheus metrics, Grafana dashboards, and distributed tracing.

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Chive │────▶│ OTel │────▶│ Prometheus │
│ Services │ │ Collector │ │ │
└─────────────┘ └──────┬──────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Jaeger │ │ Grafana │
│ (Traces) │ │ (Dashboards)│
└─────────────┘ └─────────────┘

Metrics

Prometheus configuration

# prometheus.yml
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'chive-api'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: chive-api
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: '9090'
action: keep

Key metrics

API metrics

MetricTypeDescription
http_requests_totalCounterTotal HTTP requests
http_request_duration_secondsHistogramRequest latency
http_requests_in_flightGaugeCurrent active requests

Firehose metrics

MetricTypeDescription
firehose_events_totalCounterEvents received
firehose_events_processedCounterEvents successfully processed
firehose_lag_secondsGaugeProcessing lag
firehose_cursorGaugeCurrent cursor position

Database metrics

MetricTypeDescription
db_pool_connectionsGaugeActive connections
db_query_duration_secondsHistogramQuery latency
db_errors_totalCounterQuery errors

Cache metrics

MetricTypeDescription
cache_hits_totalCounterCache hits
cache_misses_totalCounterCache misses
cache_size_bytesGaugeCache memory usage

Custom metrics

Add custom metrics in your code:

import { metrics } from '@/observability/metrics.js';

// Counter
metrics.incrementCounter('custom_events_total', { type: 'example' });

// Histogram
metrics.observeHistogram('processing_duration_seconds', durationMs / 1000);

// Gauge
metrics.setGauge('queue_depth', queueLength);

Grafana dashboards

Pre-built dashboards

DashboardIDDescription
Chive Overview1High-level system health
API Performance2Request latency, error rates
Firehose Status3Event processing, lag
Database Health4Connections, query performance

Dashboard JSON

Import dashboards from:

charts/chive/dashboards/
├── overview.json
├── api-performance.json
├── firehose-status.json
└── database-health.json

Key panels

API overview

  • Requests per second
  • Error rate (%)
  • P50/P95/P99 latency
  • Active connections

Firehose status

  • Events per second
  • Processing lag
  • Error rate
  • Queue depth

Alerting

Prometheus alerting rules

# alerts.yml
groups:
- name: chive
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
description: Error rate is {{ $value | humanizePercentage }}

- alert: FirehoseLag
expr: firehose_lag_seconds > 300
for: 10m
labels:
severity: warning
annotations:
summary: Firehose processing lag
description: Lag is {{ $value | humanizeDuration }}

- alert: DatabaseConnectionPoolExhausted
expr: db_pool_connections / db_pool_max_connections > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: Database connection pool near capacity

Alert destinations

Configure in Alertmanager:

# alertmanager.yml
route:
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty'

receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#chive-alerts'
send_resolved: true

- name: 'pagerduty'
pagerduty_configs:
- service_key: ${PAGERDUTY_KEY}

Distributed tracing

Jaeger setup

# jaeger.yml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: chive-jaeger
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200

Trace context

Traces propagate automatically through:

  • HTTP requests (via headers)
  • Queue jobs (via metadata)
  • Database queries (via spans)

Viewing traces

  1. Open Jaeger UI at https://jaeger.chive.pub
  2. Select service: chive-api
  3. Search by trace ID or operation name
  4. View waterfall diagram

Key operations

OperationDescription
HTTP GET /xrpc/*XRPC requests
firehose.processEvent processing
db.queryDatabase queries
cache.get/setCache operations
external.*.callExternal API calls

Logging

Log aggregation

# fluent-bit config
[INPUT]
Name tail
Path /var/log/containers/chive-*.log
Parser docker
Tag chive.*

[OUTPUT]
Name loki
Match chive.*
Host loki.monitoring.svc
Port 3100
Labels app=chive

Log levels

LevelWhen to use
errorUnexpected errors requiring attention
warnDegraded performance or retries
infoNormal operations (requests, events)
debugDetailed debugging (not in production)

Structured logging

logger.info('Preprint indexed', {
uri: preprint.uri,
authorDid: preprint.authorDid,
duration: indexDurationMs,
});

SLOs and SLIs

Service Level Indicators

SLIMeasurement
Availabilitysum(rate(http_requests_total{status!~"5.."})) / sum(rate(http_requests_total))
Latencyhistogram_quantile(0.95, rate(http_request_duration_seconds_bucket))
Throughputsum(rate(http_requests_total))

Service Level Objectives

SLOTarget
API availability99.9%
P95 latency< 500ms
Firehose lag< 5 minutes
Error rate< 0.1%

Health monitoring

Kubernetes monitoring

# View pod status
kubectl get pods -n chive

# View events
kubectl get events -n chive --sort-by=.lastTimestamp

# View resource usage
kubectl top pods -n chive

Database monitoring

# PostgreSQL
psql -c "SELECT * FROM pg_stat_activity;"

# Elasticsearch
curl http://localhost:9200/_cluster/health?pretty

# Neo4j
cypher-shell "CALL dbms.queryJmx('org.neo4j:*');"

# Redis
redis-cli INFO