Observability and monitoring developer guide
Overview
Chive uses the three pillars of observability: logs, metrics, and traces. These tools help you understand system behavior, diagnose issues, and monitor performance.
This guide covers:
- Explanation: How observability works in Chive
- Tutorial: Adding observability to your service
- How-to guides: Common tasks and configurations
- Reference: Environment variables and defaults
Understanding observability
The three pillars
Logs record discrete events with context. Chive writes structured JSON logs to stdout. Promtail collects logs and sends them to Loki.
Metrics track numerical measurements over time. Prometheus scrapes the /metrics endpoint. Counters, gauges, and histograms capture request rates, queue sizes, and latencies.
Traces follow requests across services. OpenTelemetry sends spans to the OTEL Collector. Tempo stores traces for querying in Grafana.
Architecture
This architecture keeps telemetry collection outside the application. Your code writes to stdout and exposes metrics. The infrastructure handles collection and storage.
Why this approach?
Stdout logging is the twelve-factor app standard. It works everywhere: Docker, Kubernetes, local development. The application does not manage log files or network connections to log aggregators.
OTEL Collector decouples your application from backends. You can switch from Tempo to Jaeger without code changes. The collector handles batching, retries, and routing.
Prometheus pull model gives operators control. They configure what to scrape and how often. Applications expose metrics without knowing the collection infrastructure.
Tutorial: Adding observability to your service
This tutorial shows you how to add logging, metrics, and tracing to a new service.
Time: 15 minutes Prerequisites:
- Chive development environment set up
- Familiarity with TypeScript
Step 1: Initialize telemetry
Call initTelemetry() before your application starts:
// src/index.ts
import { initTelemetry } from '@/observability/index.js';
// Initialize telemetry before anything else
initTelemetry({
serviceName: 'my-service',
serviceVersion: '0.4.0',
environment: process.env.NODE_ENV || 'development',
});
// Now start your application
import { createServer } from '@/api/server.js';
const app = createServer(/* ... */);
The SDK automatically instruments HTTP requests, PostgreSQL queries, and Redis operations.
Step 2: Create a logger
Create a logger instance for your service:
// src/services/eprint-indexer.ts
import { createLogger } from '@/observability/index.js';
import type { ILogger } from '@/types/interfaces/logger.interface.js';
export class EprintIndexer {
private readonly logger: ILogger;
constructor() {
this.logger = createLogger({
level: process.env.LOG_LEVEL || 'info',
service: 'eprint-indexer',
});
}
async indexEprint(uri: string): Promise<void> {
this.logger.info('Indexing eprint', { uri });
try {
// ... indexing logic
this.logger.info('Eprint indexed', { uri, duration: elapsed });
} catch (error) {
this.logger.error('Failed to index eprint', error as Error, { uri });
throw error;
}
}
}
Logs include trace context automatically. When you view logs in Grafana, you can jump to the related trace.
Step 3: Add metrics
Track request counts and latencies:
// src/services/eprint-indexer.ts
import { createMetrics } from '@/observability/index.js';
import type { IMetrics } from '@/types/interfaces/metrics.interface.js';
export class EprintIndexer {
private readonly metrics: IMetrics;
constructor() {
this.metrics = createMetrics({ prefix: 'chive_' });
}
async indexEprint(uri: string): Promise<void> {
const endTimer = this.metrics.startTimer('eprint_indexing_duration_seconds', {
operation: 'index',
});
try {
// ... indexing logic
this.metrics.incrementCounter('eprints_indexed_total', {
status: 'success',
});
} catch (error) {
this.metrics.incrementCounter('eprints_indexed_total', {
status: 'error',
});
throw error;
} finally {
endTimer();
}
}
}
View your metrics at http://localhost:3000/metrics.
Step 4: Add custom spans
Wrap important operations in spans:
// src/services/eprint-indexer.ts
import { withSpan, addSpanAttributes, SpanAttributes } from '@/observability/index.js';
export class EprintIndexer {
async indexEprint(uri: string): Promise<void> {
return withSpan('indexEprint', async () => {
addSpanAttributes({
[SpanAttributes.EPRINT_URI]: uri,
});
// Nested spans for sub-operations
const metadata = await withSpan('fetchMetadata', async () => {
return this.fetchFromPDS(uri);
});
await withSpan('storeInPostgres', async () => {
return this.storage.store(metadata);
});
await withSpan('indexInElasticsearch', async () => {
return this.search.index(metadata);
});
});
}
}
Each span shows timing and can include attributes for filtering.
Step 5: Verify it works
Start your service and make a request:
# Terminal 1: Start the service
npm run dev
# Terminal 2: Make a request
curl http://localhost:3000/api/v1/eprints
# Terminal 3: Check logs (JSON format)
# You should see structured logs with trace_id and span_id
# Terminal 4: Check metrics
curl http://localhost:3000/metrics | grep chive_
Expected output:
# HELP chive_http_requests_total Total HTTP requests
# TYPE chive_http_requests_total counter
chive_http_requests_total{method="GET",route="/api/v1/eprints",status="200"} 1
How-to guides
How to add a custom metric
Define metrics close to the code that uses them:
import type { IMetrics } from '@/types/interfaces/metrics.interface.js';
export class FirehoseConsumer {
constructor(private readonly metrics: IMetrics) {}
processEvent(event: RepoEvent): void {
// Counter: increment by 1
this.metrics.incrementCounter('firehose_events_total', {
event_type: event.type,
});
// Gauge: set absolute value
this.metrics.setGauge('firehose_cursor_lag_seconds', lagSeconds, {
relay: event.relay,
});
// Histogram: observe duration distribution
this.metrics.observeHistogram('firehose_event_processing_seconds', duration, {
event_type: event.type,
});
}
}
Naming conventions:
- Use snake_case
- End counters with
_total - End durations with
_seconds - Prefix with
chive_
Avoid high cardinality labels:
// Bad: user_id creates unbounded cardinality
this.metrics.incrementCounter('requests', { user_id: userId });
// Good: aggregate by user type
this.metrics.incrementCounter('requests', { user_type: 'authenticated' });