Observability and Monitoring Developer Guide

Version: 1.0.0 Phase: 15 - Observability & Monitoring Last Updated: December 2025

Overview

Chive uses the three pillars of observability: logs, metrics, and traces. These tools help you understand system behavior, diagnose issues, and monitor performance.

This guide covers:

Explanation: How observability works in Chive
Tutorial: Adding observability to your service
How-to guides: Common tasks and configurations
Reference: Environment variables and defaults

Understanding observability

The three pillars

Logs record discrete events with context. Chive writes structured JSON logs to stdout. Promtail collects logs and sends them to Loki.

Metrics track numerical measurements over time. Prometheus scrapes the /metrics endpoint. Counters, gauges, and histograms capture request rates, queue sizes, and latencies.

Traces follow requests across services. OpenTelemetry sends spans to the OTEL Collector. Tempo stores traces for querying in Grafana.

Architecture

This architecture keeps telemetry collection outside the application. Your code writes to stdout and exposes metrics. The infrastructure handles collection and storage.

Why this approach?

Stdout logging is the twelve-factor app standard. It works everywhere: Docker, Kubernetes, local development. The application does not manage log files or network connections to log aggregators.

OTEL Collector decouples your application from backends. You can switch from Tempo to Jaeger without code changes. The collector handles batching, retries, and routing.

Prometheus pull model gives operators control. They configure what to scrape and how often. Applications expose metrics without knowing the collection infrastructure.

Tutorial: Adding observability to your service

This tutorial shows you how to add logging, metrics, and tracing to a new service.

Time: 15 minutes Prerequisites:

Chive development environment set up
Familiarity with TypeScript

Step 1: Initialize telemetry

Call initTelemetry() before your application starts:

// src/index.ts
import { initTelemetry } from '@/observability/index.js';

// Initialize telemetry before anything else
initTelemetry({
  serviceName: 'my-service',
  serviceVersion: '0.1.0',
  environment: process.env.NODE_ENV || 'development',
});

// Now start your application
import { createServer } from '@/api/server.js';
const app = createServer(/* ... */);

The SDK automatically instruments HTTP requests, PostgreSQL queries, and Redis operations.

Step 2: Create a logger

Create a logger instance for your service:

// src/services/preprint-indexer.ts
import { createLogger } from '@/observability/index.js';
import type { ILogger } from '@/types/interfaces/logger.interface.js';

export class PreprintIndexer {
  private readonly logger: ILogger;

  constructor() {
    this.logger = createLogger({
      level: process.env.LOG_LEVEL || 'info',
      service: 'preprint-indexer',
    });
  }

  async indexPreprint(uri: string): Promise<void> {
    this.logger.info('Indexing preprint', { uri });

    try {
      // ... indexing logic
      this.logger.info('Preprint indexed', { uri, duration: elapsed });
    } catch (error) {
      this.logger.error('Failed to index preprint', error as Error, { uri });
      throw error;
    }
  }
}

Logs include trace context automatically. When you view logs in Grafana, you can jump to the related trace.

Step 3: Add metrics

Track request counts and latencies:

// src/services/preprint-indexer.ts
import { createMetrics } from '@/observability/index.js';
import type { IMetrics } from '@/types/interfaces/metrics.interface.js';

export class PreprintIndexer {
  private readonly metrics: IMetrics;

  constructor() {
    this.metrics = createMetrics({ prefix: 'chive_' });
  }

  async indexPreprint(uri: string): Promise<void> {
    const endTimer = this.metrics.startTimer('preprint_indexing_duration_seconds', {
      operation: 'index',
    });

    try {
      // ... indexing logic

      this.metrics.incrementCounter('preprints_indexed_total', {
        status: 'success',
      });
    } catch (error) {
      this.metrics.incrementCounter('preprints_indexed_total', {
        status: 'error',
      });
      throw error;
    } finally {
      endTimer();
    }
  }
}

View your metrics at http://localhost:3000/metrics.

Step 4: Add custom spans

Wrap important operations in spans:

// src/services/preprint-indexer.ts
import { withSpan, addSpanAttributes, SpanAttributes } from '@/observability/index.js';

export class PreprintIndexer {
  async indexPreprint(uri: string): Promise<void> {
    return withSpan('indexPreprint', async () => {
      addSpanAttributes({
        [SpanAttributes.PREPRINT_URI]: uri,
      });

      // Nested spans for sub-operations
      const metadata = await withSpan('fetchMetadata', async () => {
        return this.fetchFromPDS(uri);
      });

      await withSpan('storeInPostgres', async () => {
        return this.storage.store(metadata);
      });

      await withSpan('indexInElasticsearch', async () => {
        return this.search.index(metadata);
      });
    });
  }
}

Each span shows timing and can include attributes for filtering.

Step 5: Verify it works

Start your service and make a request:

# Terminal 1: Start the service
npm run dev

# Terminal 2: Make a request
curl http://localhost:3000/api/v1/preprints

# Terminal 3: Check logs (JSON format)
# You should see structured logs with trace_id and span_id

# Terminal 4: Check metrics
curl http://localhost:3000/metrics | grep chive_

Expected output:

# HELP chive_http_requests_total Total HTTP requests
# TYPE chive_http_requests_total counter
chive_http_requests_total{method="GET",route="/api/v1/preprints",status="200"} 1

How-to guides

How to add a custom metric

Define metrics close to the code that uses them:

import type { IMetrics } from '@/types/interfaces/metrics.interface.js';

export class FirehoseConsumer {
  constructor(private readonly metrics: IMetrics) {}

  processEvent(event: RepoEvent): void {
    // Counter: increment by 1
    this.metrics.incrementCounter('firehose_events_total', {
      event_type: event.type,
    });

    // Gauge: set absolute value
    this.metrics.setGauge('firehose_cursor_lag_seconds', lagSeconds, {
      relay: event.relay,
    });

    // Histogram: observe duration distribution
    this.metrics.observeHistogram('firehose_event_processing_seconds', duration, {
      event_type: event.type,
    });
  }
}

Naming conventions:

Use snake_case
End counters with _total
End durations with _seconds
Prefix with chive_

Avoid high cardinality labels:

// Bad: user_id creates unbounded cardinality
this.metrics.incrementCounter('requests', { user_id: userId });

// Good: aggregate by user type
this.metrics.incrementCounter('requests', { user_type: 'authenticated' });

How to add a custom span

Use withSpan for async operations:

import { withSpan, addSpanAttributes, recordSpanError } from '@/observability/index.js';

async function processDocument(doc: Document): Promise<Result> {
  return withSpan(
    'processDocument',
    async () => {
      // Add attributes for filtering and debugging
      addSpanAttributes({
        'document.id': doc.id,
        'document.size': doc.size,
      });

      const result = await transform(doc);

      // Add result attributes
      addSpanAttributes({
        'result.status': result.status,
      });

      return result;
    },
    {
      attributes: {
        'document.type': doc.type,
      },
    }
  );
}

Use withSpanSync for synchronous operations:

import { withSpanSync } from '@/observability/index.js';

function parseMetadata(raw: string): Metadata {
  return withSpanSync('parseMetadata', () => {
    return JSON.parse(raw);
  });
}

How to configure log levels

Set the log level via environment variable:

# Options: trace, debug, info, warn, error, fatal
LOG_LEVEL=debug npm run dev

Or configure programmatically:

const logger = createLogger({
  level: 'debug', // Show debug and above
  service: 'my-service',
});

Log level guidelines:

Level	Use for
`trace`	Very detailed debugging (loops)
`debug`	Debugging information
`info`	Normal operations, state changes
`warn`	Unexpected but handled conditions
`error`	Errors that need attention
`fatal`	Unrecoverable errors before exit

How to create child loggers

Add context that persists across log calls:

const requestLogger = logger.child({
  requestId: 'req_abc123',
  userId: 'did:plc:user',
});

requestLogger.info('Processing request'); // Includes requestId, userId
requestLogger.info('Fetching data'); // Includes requestId, userId
requestLogger.info('Request complete'); // Includes requestId, userId

How to redact sensitive data

The logger automatically redacts these fields:

password, secret, token
apiKey, api_key
accessToken, refreshToken
authorization, privateKey
sessionToken, credential

To log sensitive context safely:

// Bad: password appears in logs
logger.info('Login attempt', { username, password });

// Good: password is redacted automatically
logger.info('Login attempt', { username, password: userInput.password });
// Output: { username: "alice", password: "[REDACTED]" }

// Better: don't log sensitive fields at all
logger.info('Login attempt', { username });

How to query traces by ID

Extract trace context for cross-service correlation:

import { getTraceContext } from '@/observability/index.js';

async function handleRequest(req: Request, res: Response): Promise<void> {
  const traceContext = getTraceContext();

  if (traceContext) {
    // Include in response headers for client debugging
    res.setHeader('X-Trace-Id', traceContext.trace_id);

    // Include in logs for correlation
    logger.info('Request handled', {
      trace_id: traceContext.trace_id,
      span_id: traceContext.span_id,
    });
  }
}

In Grafana, search for traces using the trace ID:

Open Grafana
Go to Explore
Select Tempo data source
Enter trace ID in search

How to create Grafana alerts

Alert rules are defined in k8s/monitoring/alert-rules.yaml:

groups:
  - name: chive-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(chive_http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(chive_http_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: 'High error rate detected'
          description: 'Error rate is {{ $value | humanizePercentage }}'

Test alerts locally:

# Port-forward Prometheus
kubectl port-forward svc/prometheus 9090:9090

# View active alerts
open http://localhost:9090/alerts

How to add health check dependencies

Extend the health check to include new services:

// In your health check handler
const checks: Record<string, HealthCheckResult> = {};

// Add your service check
try {
  const start = performance.now();
  await myService.ping();
  checks.myService = {
    status: 'pass',
    latencyMs: Math.round(performance.now() - start),
  };
} catch (error) {
  checks.myService = {
    status: 'fail',
    error: error instanceof Error ? error.message : 'Unknown error',
  };
}

Reference

Environment variables

Variable	Default	Description
`LOG_LEVEL`	`info`	Minimum log level to output
`OTEL_SERVICE_NAME`	`chive-appview`	Service name in telemetry
`OTEL_SERVICE_VERSION`	from package.json	Service version
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://otel-collector:4318`	OTEL Collector HTTP endpoint
`NODE_ENV`	`development`	Environment name

Default metrics

The following metrics are registered automatically:

HTTP metrics

Metric	Type	Labels
`chive_http_requests_total`	Counter	method, route, status
`chive_http_request_duration_seconds`	Histogram	method, route

Preprint metrics

Metric	Type	Labels
`chive_preprints_indexed_total`	Counter	field, status
`chive_preprint_indexing_duration_seconds`	Histogram	operation

Firehose metrics

Metric	Type	Labels
`chive_firehose_events_total`	Counter	event_type
`chive_firehose_cursor_lag_seconds`	Gauge	relay
`chive_firehose_active_connections`	Gauge	relay

Database metrics

Metric	Type	Labels
`chive_db_connections_active`	Gauge	database
`chive_db_query_duration_seconds`	Histogram	database, operation

Histogram buckets

Default histogram buckets (in seconds):

[0.01, 0.05, 0.1, 0.5, 1, 5, 10]

Customize buckets when creating metrics:

const metrics = createMetrics({
  prefix: 'chive_',
  defaultBuckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5],
});

Span attributes

Use these constants for consistent span naming:

import { SpanAttributes } from '@/observability/index.js';

// HTTP attributes
SpanAttributes.HTTP_METHOD; // 'http.method'
SpanAttributes.HTTP_ROUTE; // 'http.route'
SpanAttributes.HTTP_STATUS_CODE; // 'http.status_code'

// Database attributes
SpanAttributes.DB_SYSTEM; // 'db.system'
SpanAttributes.DB_NAME; // 'db.name'
SpanAttributes.DB_OPERATION; // 'db.operation'

// Chive-specific attributes
SpanAttributes.PREPRINT_URI; // 'chive.preprint.uri'
SpanAttributes.USER_DID; // 'chive.user.did'
SpanAttributes.REQUEST_ID; // 'chive.request.id'

Log format

Logs are JSON objects with these fields:

{
  "level": 30,
  "time": 1702300800000,
  "msg": "Request completed",
  "service": "chive-appview",
  "environment": "production",
  "version": "0.1.0",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "requestId": "req_xyz",
  "duration": 150
}

Field	Type	Description
`level`	number	Pino log level (30=info)
`time`	number	Unix timestamp (milliseconds)
`msg`	string	Log message
`service`	string	Service name
`environment`	string	Deployment environment
`version`	string	Service version
`trace_id`	string	OpenTelemetry trace ID (if any)
`span_id`	string	OpenTelemetry span ID (if any)

Alert severity levels

Severity	Response time	Example
`critical`	Immediate	Service down, data loss risk
`warning`	Within 1 hour	Degraded performance, high lag
`info`	Next business	Anomalies, capacity planning

Troubleshooting

Traces not appearing in Tempo

Check that telemetry is initialized before any imports:

// This must be first
initTelemetry({ serviceName: 'my-service' });

// Then import application code
import { createServer } from '@/api/server.js';

Verify the OTEL Collector is running:

kubectl get pods -n monitoring | grep otel

Check collector logs for export errors:

kubectl logs -n monitoring deployment/otel-collector

Metrics missing from Prometheus

Check the /metrics endpoint returns data:

curl http://localhost:3000/metrics | grep chive_

Verify Prometheus is scraping your service:

# In Prometheus UI (Status > Targets)
open http://localhost:9090/targets

Check for metric name collisions. Each metric name must be unique per type.

Logs not appearing in Loki

Verify logs are going to stdout:

kubectl logs deployment/chive-appview | head

Check Promtail is running:

kubectl get pods -n monitoring | grep promtail

Verify Promtail labels match your Loki queries:
```
{app="chive-appview"}
```

High cardinality warning

If you see "Cardinality limit reached for metric":

Review label values. Each unique combination creates a time series.
Remove high-cardinality labels like user IDs or request IDs.
Use histograms instead of per-request metrics.

Overview​

Understanding observability​

The three pillars​

Architecture​

Why this approach?​

Tutorial: Adding observability to your service​

Step 1: Initialize telemetry​

Step 2: Create a logger​

Step 3: Add metrics​

Step 4: Add custom spans​

Step 5: Verify it works​

How-to guides​

How to add a custom metric​

How to add a custom span​

How to configure log levels​

How to create child loggers​

How to redact sensitive data​

How to query traces by ID​

How to create Grafana alerts​

How to add health check dependencies​

Reference​

Environment variables​

Default metrics​

HTTP metrics​

Preprint metrics​

Firehose metrics​

Database metrics​

Histogram buckets​

Span attributes​

Log format​

Alert severity levels​

Troubleshooting​

Traces not appearing in Tempo​

Metrics missing from Prometheus​

Logs not appearing in Loki​

High cardinality warning​

Related documentation​