How to build an observability stack for production systems

- Bottom line: A production-grade observability stack requires three pillars — structured logs, time-series metrics, and distributed traces — unified through OpenTelemetry and stored in purpose-built backends (ELK/Loki for logs, Prometheus/Mimir for metrics, Jaeger/Tempo for traces).

ELK vs Loki vs cloud-native monitoring comparison

- Bottom line: A production-grade observability stack requires three pillars — structured logs, time-series metrics, and distributed traces — unified through OpenTelemetry and stored in purpose-built backends (ELK/Loki for logs, Prometheus/Mimir for metrics, Jaeger/Tempo for traces).

What is the best logging and monitoring architecture

- Bottom line: A production-grade observability stack requires three pillars — structured logs, time-series metrics, and distributed traces — unified through OpenTelemetry and stored in purpose-built backends (ELK/Loki for logs, Prometheus/Mimir for metrics, Jaeger/Tempo for traces).

How to set up centralized logging with metrics and traces

- Bottom line: A production-grade observability stack requires three pillars — structured logs, time-series metrics, and distributed traces — unified through OpenTelemetry and stored in purpose-built backends (ELK/Loki for logs, Prometheus/Mimir for metrics, Jaeger/Tempo for traces).

Logging and Monitoring Infrastructure Design

How do I design a logging and monitoring infrastructure?

TL;DR

Bottom line: A production-grade observability stack requires three pillars — structured logs, time-series metrics, and distributed traces — unified through OpenTelemetry and stored in purpose-built backends (ELK/Loki for logs, Prometheus/Mimir for metrics, Jaeger/Tempo for traces).
Key tool/command: otel-collector (OpenTelemetry Collector) as the universal telemetry pipeline — receives, processes, and exports all three signal types.
Watch out for: Logging everything at DEBUG level in production — it will overwhelm storage, spike costs, and obscure real issues in noise.
Works with: Any language with OpenTelemetry SDK support (Go, Java, Python, Node.js, .NET, Rust, C++, Ruby, PHP, Swift, Erlang); Kubernetes and VM deployments; all major cloud providers.

Constraints

Never log secrets, tokens, passwords, PII, or credit card numbers — sanitize all sensitive fields before emission
Retention policies must comply with local data regulations (GDPR defaults to 30-day minimum, SOC2 requires 1-year retention minimum)
Always use structured (JSON) logging in production — unstructured plaintext breaks automated parsing, indexing, and alerting
Instrument with OpenTelemetry SDK when possible — it is vendor-neutral and prevents backend lock-in
Set per-service log-level controls — global DEBUG across all services will cost 5-10x more in storage and processing

Quick Reference

Component	Role	Technology Options	Scaling Strategy
Log Collection Agent	Ships logs from hosts/containers to aggregator	Fluent Bit, Filebeat, Vector, OTel Collector	DaemonSet per node (K8s) or sidecar
Log Aggregation	Centralizes, parses, enriches log streams	Logstash, Fluentd, OTel Collector	Horizontal replicas behind buffer (Kafka)
Log Storage & Search	Indexes and queries log data	Elasticsearch, Grafana Loki, ClickHouse	Elasticsearch: shard-per-index; Loki: label-based partitioning
Metrics Collection	Scrapes/receives numeric time-series data	Prometheus, OTel Collector, Telegraf	Federation or Thanos/Mimir for multi-cluster
Metrics Storage	Long-term time-series persistence	Prometheus TSDB, Thanos, Mimir, VictoriaMetrics	Remote-write to durable store; compaction + downsampling
Trace Collection	Captures distributed request spans	OTel SDK + Collector, Jaeger Agent	Tail-based sampling at Collector tier
Trace Storage	Stores and indexes span data	Jaeger, Grafana Tempo, Zipkin	Tempo: object storage (S3/GCS); Jaeger: Elasticsearch/Cassandra
Visualization	Dashboards, exploration, correlation	Grafana, Kibana, Datadog UI	Read replicas; CDN for static assets
Alerting	Evaluates rules, routes notifications	Alertmanager, Grafana Alerting, PagerDuty	HA pairs with deduplication
Buffer/Queue	Decouples producers from consumers	Apache Kafka, Redis Streams, Amazon Kinesis	Kafka: partition-per-topic scaling
Service Mesh Telemetry	Auto-instruments inter-service traffic	Istio, Linkerd, Envoy (built-in)	Sidecar proxy per pod
Pipeline Orchestrator	Unified telemetry routing and processing	OpenTelemetry Collector, Vector	Gateway mode for centralized processing

Decision Tree

START: Choose your logging backend
├── Budget < $500/month AND log volume < 50 GB/day?
│   ├── YES → Grafana Loki + Promtail (low resource, label-indexed)
│   └── NO ↓
├── Need full-text search across all log fields?
│   ├── YES → Elasticsearch (ELK Stack) — powerful query language (KQL/Lucene)
│   └── NO ↓
├── Want zero operational overhead?
│   ├── YES → Managed SaaS (Datadog, Splunk Cloud, AWS CloudWatch)
│   └── NO ↓
├── Running on Kubernetes?
│   ├── YES → Loki + OTel Collector DaemonSet (native K8s metadata enrichment)
│   └── NO ↓
├── Log volume > 1 TB/day AND need SQL analytics?
│   ├── YES → ClickHouse + Vector (columnar storage, fast aggregations)
│   └── NO ↓
└── DEFAULT → ELK Stack (most documentation, largest community)

METRICS BACKEND:
├── Already using Grafana?
│   ├── YES → Prometheus + Grafana (native integration)
│   └── NO ↓
├── Multi-cluster / global federation needed?
│   ├── YES → Thanos or Grafana Mimir (long-term, multi-cluster Prometheus)
│   └── NO ↓
└── DEFAULT → Prometheus (pull-based, CNCF graduated, industry standard)

TRACES BACKEND:
├── Want minimal infrastructure?
│   ├── YES → Grafana Tempo (object storage, no indexing required)
│   └── NO ↓
├── Need deep trace analytics and search?
│   ├── YES → Jaeger with Elasticsearch backend
│   └── NO ↓
└── DEFAULT → OTel Collector → Tempo (simplest path)

Step-by-Step Guide

1. Define your telemetry signals and data model

Establish which of the three pillars — logs, metrics, traces — you need from day one. Most production systems need all three. Define a consistent naming convention for metrics (service_name_operation_unit), log fields (timestamp, level, service, trace_id, message), and trace attributes (service.name, deployment.environment). [src1]

# OpenTelemetry resource attributes (define once per service)
resource:
  attributes:
    service.name: "payment-service"
    service.version: "2.1.0"
    deployment.environment: "production"
    service.namespace: "checkout"

Verify: All services emit service.name and deployment.environment in every telemetry signal.

2. Instrument applications with OpenTelemetry SDKs

Add the OpenTelemetry SDK to each service. Use auto-instrumentation for common frameworks (Express, Flask, Spring Boot) and add manual spans for business-critical paths. [src1]

# Python: install OpenTelemetry packages
pip install opentelemetry-api==1.29.0 \
            opentelemetry-sdk==1.29.0 \
            opentelemetry-exporter-otlp==1.29.0 \
            opentelemetry-instrumentation-flask==0.50b0 \
            opentelemetry-instrumentation-requests==0.50b0

# Node.js: install OpenTelemetry packages
npm install @opentelemetry/[email protected] \
            @opentelemetry/[email protected] \
            @opentelemetry/[email protected] \
            @opentelemetry/[email protected]

Verify: curl http://localhost:4318/v1/traces returns 200 from the local OTel Collector.

3. Deploy the OpenTelemetry Collector as the central pipeline

The OTel Collector acts as a vendor-neutral proxy that receives telemetry from all services, processes it (batching, sampling, enrichment), and exports to your chosen backends. Deploy as a DaemonSet in Kubernetes or a sidecar/gateway in VM environments. [src6]

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  otlphttp/loki:
    endpoint: "http://loki:3100/otlp"
  prometheusremotewrite:
    endpoint: "http://mimir:9009/api/v1/push"
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/loki]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

Verify: curl -s http://localhost:13133/ returns {"status":"Server available"} from the Collector health check.

4. Set up log storage backend

Choose between Elasticsearch (full-text search), Loki (label-indexed, low cost), or a managed service. Configure index lifecycle management (ILM) to auto-rotate and delete old indices. [src3] [src4]

// Elasticsearch ILM policy example
PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot":    { "actions": { "rollover": { "max_size": "50gb", "max_age": "1d" } } },
      "warm":   { "min_age": "7d",  "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } } },
      "cold":   { "min_age": "30d", "actions": { "searchable_snapshot": { "snapshot_repository": "s3-repo" } } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

Verify: curl -s http://localhost:9200/_ilm/policy/logs-policy | jq . shows the policy is active.

5. Configure metrics collection with Prometheus

Deploy Prometheus with service discovery for your environment (Kubernetes annotations, Consul, or file-based targets). Define recording rules for pre-aggregation and alerting rules for SLO-based alerts. [src2]

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "recording_rules.yml"
  - "alerting_rules.yml"

scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:${2}

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Verify: curl http://localhost:9090/api/v1/targets shows all expected targets as "health": "up".

6. Build Grafana dashboards for visualization and alerting

Create dashboards following the RED method (Rate, Errors, Duration) for services and USE method (Utilization, Saturation, Errors) for infrastructure. Set up alert rules with meaningful thresholds and route to appropriate channels. [src2] [src5]

# Deploy Grafana with provisioned datasources
docker run -d --name grafana \
  -p 3000:3000 \
  -e GF_SECURITY_ADMIN_PASSWORD=changeme \
  -v ./grafana/provisioning:/etc/grafana/provisioning \
  grafana/grafana:11.4.0

# Verify datasource connectivity
curl -u admin:changeme \
  http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up

Verify: Grafana UI at http://localhost:3000 shows data sources as "connected" (green).

Code Examples

Python: Structured Logging with OpenTelemetry Context

# Input:  Application events during request handling
# Output: JSON log lines with trace_id, span_id, and structured fields

import structlog  # structlog==24.4.0
from opentelemetry import trace

# Configure structlog for JSON output with OTel context
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.make_filtering_bound_logger(20),
)

logger = structlog.get_logger()

def process_order(order_id: str, amount: float):
    """Log with automatic trace context propagation."""
    span = trace.get_current_span()
    ctx = span.get_span_context()
    # Bind trace context so every log in this scope includes it
    logger.bind(
        trace_id=format(ctx.trace_id, "032x"),
        span_id=format(ctx.span_id, "016x"),
        order_id=order_id,
    )
    logger.info("order_processing_started", amount=amount)
    try:
        # ... business logic ...
        logger.info("order_processing_completed", amount=amount)
    except Exception as e:
        logger.error("order_processing_failed",
                     error=str(e), error_type=type(e).__name__)
        raise

Node.js: Structured Logging with Pino and OpenTelemetry

// Input:  HTTP requests to an Express service
// Output: JSON log lines with trace context, request metadata

const pino = require("pino");       // [email protected]
const { trace } = require("@opentelemetry/api"); // @opentelemetry/[email protected]

const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  redact: ["req.headers.authorization", "body.password"],
});

function withTraceContext(log) {
  const span = trace.getActiveSpan();
  if (!span) return log;
  const ctx = span.spanContext();
  return log.child({
    trace_id: ctx.traceId,
    span_id: ctx.spanId,
  });
}

// Express middleware example
app.use((req, res, next) => {
  req.log = withTraceContext(logger).child({
    method: req.method,
    path: req.url,
    request_id: req.headers["x-request-id"],
  });
  req.log.info("request_received");
  next();
});

Python: Prometheus Metrics with Labels

# Input:  HTTP request handling in a Flask/FastAPI service
# Output: Prometheus metrics exposed at /metrics endpoint

from prometheus_client import (  # prometheus-client==0.21.1
    Counter, Histogram, Gauge, start_http_server
)

# RED method metrics for services
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"],
)
REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration in seconds",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
# USE method metrics for resources
QUEUE_SIZE = Gauge(
    "task_queue_size",
    "Current number of tasks in the processing queue",
    ["queue_name"],
)

def track_request(method, endpoint, status, duration):
    """Record RED metrics for a completed request."""
    REQUEST_COUNT.labels(
        method=method, endpoint=endpoint, status=status
    ).inc()
    REQUEST_DURATION.labels(
        method=method, endpoint=endpoint
    ).observe(duration)

# Start metrics server on port 9090
start_http_server(9090)

YAML: Complete Docker Compose Observability Stack

# Input:  Docker environment needing full observability
# Output: Running Loki + Prometheus + Tempo + Grafana + OTel Collector

version: "3.9"
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    command: ["--config=/etc/otel/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "13133:13133" # Health check

  loki:
    image: grafana/loki:3.4.0
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki

  prometheus:
    image: prom/prometheus:v3.1.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom-data:/prometheus
    ports:
      - "9090:9090"

  tempo:
    image: grafana/tempo:2.7.0
    command: ["-config.file=/etc/tempo/config.yaml"]
    volumes:
      - ./tempo-config.yaml:/etc/tempo/config.yaml
      - tempo-data:/var/tempo
    ports:
      - "3200:3200"

  grafana:
    image: grafana/grafana:11.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana-data:/var/lib/grafana

volumes:
  loki-data:
  prom-data:
  tempo-data:
  grafana-data:

Anti-Patterns

Wrong: Unstructured string concatenation in logs

# BAD — unstructured text makes automated parsing impossible
import logging
logger = logging.getLogger(__name__)

def process_payment(user_id, amount):
    logger.info("Processing payment for user " + user_id + " amount: $" + str(amount))
    # Output: "Processing payment for user u-123 amount: $49.99"
    # Cannot query by user_id or amount without regex parsing

Correct: Structured JSON logging with typed fields

# GOOD — structured fields are indexable, queryable, and alertable
import structlog
logger = structlog.get_logger()

def process_payment(user_id: str, amount: float):
    logger.info("payment_processing",
                user_id=user_id, amount=amount, currency="USD")
    # Output: {"event":"payment_processing","user_id":"u-123","amount":49.99,"currency":"USD"}
    # Every field is independently queryable

Wrong: Logging sensitive data without redaction

// BAD — PII and secrets in plain text logs
logger.info("User login", {
  email: user.email,           // PII
  password: req.body.password, // credential
  token: session.jwt,          // secret
  ssn: user.socialSecurity,    // regulated data
});

Correct: Redacting sensitive fields before logging

// GOOD — redact sensitive fields, log only safe identifiers
const pino = require("pino");
const logger = pino({
  redact: ["password", "token", "ssn", "*.authorization"],
});

logger.info("User login", {
  user_id: user.id,            // safe identifier
  email: "[REDACTED]",         // or omit entirely
  ip: req.ip,                  // may need consent in EU
  login_method: "password",
});

Wrong: Using a single high-cardinality metric label

# BAD — user_id as a Prometheus label creates millions of time series
# This will crash Prometheus or cause massive memory usage
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Requests",
    ["method", "endpoint", "user_id"],  # user_id = cardinality bomb
)

Correct: Keeping label cardinality bounded

# GOOD — use only bounded-cardinality labels; track per-user in logs/traces
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Requests",
    ["method", "endpoint", "status"],  # all bounded enums
)
# For per-user analytics, use log fields or trace attributes instead
logger.info("request_completed", user_id=user_id, duration_ms=42)

Wrong: Sampling traces at 100% in production

# BAD — storing every single trace in production is extremely expensive
# A service handling 10K req/s generates ~864M spans/day
processors:
  # No sampling configured = 100% of traces stored

Correct: Tail-based sampling for intelligent trace retention

# GOOD — sample based on error status, latency, and a baseline rate
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline-sample
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Common Pitfalls

High-cardinality label explosion: Using unbounded values (user IDs, request IDs, URLs with query params) as Prometheus labels causes memory exhaustion. Fix: keep label values to bounded enums; use relabel_configs to drop or hash high-cardinality labels. [src2]
No log-level gating in production: Emitting DEBUG logs for all services in production generates 5-10x more data, spiking storage costs. Fix: set default to WARN or INFO per service; use dynamic log-level adjustment via feature flags or config reload. [src7]
Missing correlation IDs across signals: Logs, metrics, and traces without shared identifiers (trace_id) are impossible to correlate during incidents. Fix: propagate trace_id via OpenTelemetry context; inject into all structured log fields automatically. [src1]
Single-node Elasticsearch with no ILM: Running Elasticsearch on a single node with default settings leads to disk exhaustion within weeks. Fix: configure index lifecycle management (ILM) with hot/warm/cold/delete phases from day one. [src3]
Alerting on raw metrics instead of SLOs: Alerts on CPU > 80% or error count > 100 produce constant false positives. Fix: define SLI/SLO-based alerts (e.g., error budget burn rate > 2x) that reflect actual user impact. [src5]
No buffer between producers and consumers: A direct Filebeat-to-Elasticsearch pipeline means Elasticsearch downtime causes log loss. Fix: insert Kafka or Redis Streams as a buffer to decouple ingestion from storage. [src3]
Ignoring log rotation and retention: Unbounded log storage fills disks, causes outages, and violates compliance policies. Fix: configure logrotate locally, ILM in Elasticsearch, and retention rules in Loki (retention_period: 720h). [src4]
Alert fatigue from noisy thresholds: Too many low-priority alerts desensitize the on-call team. Fix: classify alerts by severity (critical/warning/info), route only critical to PagerDuty, aggregate warnings into daily digests. [src7]

Diagnostic Commands

# Check OpenTelemetry Collector health
curl -s http://localhost:13133/ | jq .

# Verify Prometheus targets are being scraped
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check Elasticsearch cluster health
curl -s http://localhost:9200/_cluster/health | jq '{status, number_of_nodes, active_shards}'

# Verify Loki is receiving logs
curl -s http://localhost:3100/ready

# Query recent logs from Loki via LogQL
curl -G -s http://localhost:3100/loki/api/v1/query_range \
  --data-urlencode 'query={service_name="payment-service"} |= "error"' \
  --data-urlencode 'limit=10' | jq .

# Check Grafana datasource connectivity
curl -u admin:changeme -s http://localhost:3000/api/datasources | jq '.[].name'

# Verify Tempo is receiving traces
curl -s http://localhost:3200/ready

# Check Prometheus storage TSDB stats
curl -s http://localhost:9090/api/v1/status/tsdb | jq '{headChunks: .data.headStats.numChunks, seriesCount: .data.headStats.numSeries}'

# Monitor OTel Collector pipeline metrics
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted

Version History & Compatibility

Component	Current Version	Status	Key Change
OpenTelemetry Collector	0.96.x (2026)	Stable (logs GA since 2024-04)	Logs signal reached GA; unified pipeline for all three pillars
Prometheus	3.x (2025+)	Current	Native OTLP ingestion, UTF-8 metric names, new UI
Elasticsearch	8.x (2022+)	Current	License changed to SSPL+Elastic; Lucene 9; vector search
Grafana Loki	3.x (2024+)	Current	OTLP native ingestion; structured metadata; bloom filters
Grafana Tempo	2.x (2023+)	Current	TraceQL query language; vParquet4 format
Grafana	11.x (2024+)	Current	Unified alerting; Explore Logs/Traces/Metrics

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Building microservices with >3 services that interact	Single monolith with <1K req/day	Simple file-based logging with logrotate
Need to correlate events across services during incidents	Static websites or JAMstack with no backend	CDN analytics (Cloudflare, Vercel)
Compliance requires audit logs with retention guarantees	Prototyping or hackathon with no uptime SLA	Console.log with local files
Operating Kubernetes clusters in production	Running a single Docker container locally	Docker logs command
Need SLO-based alerting with error budget tracking	Team has no on-call rotation or incident process	Simple uptime monitoring (Uptime Robot, Pingdom)

Important Caveats

OpenTelemetry SDK auto-instrumentation adds 1-5% latency overhead per service; benchmark before enabling in latency-sensitive hot paths
Elasticsearch 8.x uses the SSPL license (not Apache 2.0) — verify legal compliance for your use case; OpenSearch is the Apache 2.0 fork
Prometheus is pull-based by default — for short-lived jobs (serverless, batch), use the Pushgateway or switch to OTel push-based collection
Loki does not index log content (only labels) — complex full-text searches across log bodies are significantly slower than Elasticsearch
Managed observability services (Datadog, New Relic, Splunk) cost 3-10x more than self-hosted at scale but eliminate operational overhead — break-even is typically at 50-200 GB/day depending on team size
Trace sampling decisions must be made carefully: 100% sampling is prohibitively expensive, but too-aggressive sampling will miss rare errors