Logging and Monitoring Infrastructure Design

Type: Software Reference Confidence: 0.93 Sources: 7 Verified: 2026-02-23 Freshness: 2026-02-23

TL;DR

Constraints

Quick Reference

ComponentRoleTechnology OptionsScaling Strategy
Log Collection AgentShips logs from hosts/containers to aggregatorFluent Bit, Filebeat, Vector, OTel CollectorDaemonSet per node (K8s) or sidecar
Log AggregationCentralizes, parses, enriches log streamsLogstash, Fluentd, OTel CollectorHorizontal replicas behind buffer (Kafka)
Log Storage & SearchIndexes and queries log dataElasticsearch, Grafana Loki, ClickHouseElasticsearch: shard-per-index; Loki: label-based partitioning
Metrics CollectionScrapes/receives numeric time-series dataPrometheus, OTel Collector, TelegrafFederation or Thanos/Mimir for multi-cluster
Metrics StorageLong-term time-series persistencePrometheus TSDB, Thanos, Mimir, VictoriaMetricsRemote-write to durable store; compaction + downsampling
Trace CollectionCaptures distributed request spansOTel SDK + Collector, Jaeger AgentTail-based sampling at Collector tier
Trace StorageStores and indexes span dataJaeger, Grafana Tempo, ZipkinTempo: object storage (S3/GCS); Jaeger: Elasticsearch/Cassandra
VisualizationDashboards, exploration, correlationGrafana, Kibana, Datadog UIRead replicas; CDN for static assets
AlertingEvaluates rules, routes notificationsAlertmanager, Grafana Alerting, PagerDutyHA pairs with deduplication
Buffer/QueueDecouples producers from consumersApache Kafka, Redis Streams, Amazon KinesisKafka: partition-per-topic scaling
Service Mesh TelemetryAuto-instruments inter-service trafficIstio, Linkerd, Envoy (built-in)Sidecar proxy per pod
Pipeline OrchestratorUnified telemetry routing and processingOpenTelemetry Collector, VectorGateway mode for centralized processing

Decision Tree

START: Choose your logging backend
├── Budget < $500/month AND log volume < 50 GB/day?
│   ├── YES → Grafana Loki + Promtail (low resource, label-indexed)
│   └── NO ↓
├── Need full-text search across all log fields?
│   ├── YES → Elasticsearch (ELK Stack) — powerful query language (KQL/Lucene)
│   └── NO ↓
├── Want zero operational overhead?
│   ├── YES → Managed SaaS (Datadog, Splunk Cloud, AWS CloudWatch)
│   └── NO ↓
├── Running on Kubernetes?
│   ├── YES → Loki + OTel Collector DaemonSet (native K8s metadata enrichment)
│   └── NO ↓
├── Log volume > 1 TB/day AND need SQL analytics?
│   ├── YES → ClickHouse + Vector (columnar storage, fast aggregations)
│   └── NO ↓
└── DEFAULT → ELK Stack (most documentation, largest community)

METRICS BACKEND:
├── Already using Grafana?
│   ├── YES → Prometheus + Grafana (native integration)
│   └── NO ↓
├── Multi-cluster / global federation needed?
│   ├── YES → Thanos or Grafana Mimir (long-term, multi-cluster Prometheus)
│   └── NO ↓
└── DEFAULT → Prometheus (pull-based, CNCF graduated, industry standard)

TRACES BACKEND:
├── Want minimal infrastructure?
│   ├── YES → Grafana Tempo (object storage, no indexing required)
│   └── NO ↓
├── Need deep trace analytics and search?
│   ├── YES → Jaeger with Elasticsearch backend
│   └── NO ↓
└── DEFAULT → OTel Collector → Tempo (simplest path)

Step-by-Step Guide

1. Define your telemetry signals and data model

Establish which of the three pillars — logs, metrics, traces — you need from day one. Most production systems need all three. Define a consistent naming convention for metrics (service_name_operation_unit), log fields (timestamp, level, service, trace_id, message), and trace attributes (service.name, deployment.environment). [src1]

# OpenTelemetry resource attributes (define once per service)
resource:
  attributes:
    service.name: "payment-service"
    service.version: "2.1.0"
    deployment.environment: "production"
    service.namespace: "checkout"

Verify: All services emit service.name and deployment.environment in every telemetry signal.

2. Instrument applications with OpenTelemetry SDKs

Add the OpenTelemetry SDK to each service. Use auto-instrumentation for common frameworks (Express, Flask, Spring Boot) and add manual spans for business-critical paths. [src1]

# Python: install OpenTelemetry packages
pip install opentelemetry-api==1.29.0 \
            opentelemetry-sdk==1.29.0 \
            opentelemetry-exporter-otlp==1.29.0 \
            opentelemetry-instrumentation-flask==0.50b0 \
            opentelemetry-instrumentation-requests==0.50b0

# Node.js: install OpenTelemetry packages
npm install @opentelemetry/[email protected] \
            @opentelemetry/[email protected] \
            @opentelemetry/[email protected] \
            @opentelemetry/[email protected]

Verify: curl http://localhost:4318/v1/traces returns 200 from the local OTel Collector.

3. Deploy the OpenTelemetry Collector as the central pipeline

The OTel Collector acts as a vendor-neutral proxy that receives telemetry from all services, processes it (batching, sampling, enrichment), and exports to your chosen backends. Deploy as a DaemonSet in Kubernetes or a sidecar/gateway in VM environments. [src6]

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  otlphttp/loki:
    endpoint: "http://loki:3100/otlp"
  prometheusremotewrite:
    endpoint: "http://mimir:9009/api/v1/push"
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/loki]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

Verify: curl -s http://localhost:13133/ returns {"status":"Server available"} from the Collector health check.

4. Set up log storage backend

Choose between Elasticsearch (full-text search), Loki (label-indexed, low cost), or a managed service. Configure index lifecycle management (ILM) to auto-rotate and delete old indices. [src3] [src4]

// Elasticsearch ILM policy example
PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot":    { "actions": { "rollover": { "max_size": "50gb", "max_age": "1d" } } },
      "warm":   { "min_age": "7d",  "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } } },
      "cold":   { "min_age": "30d", "actions": { "searchable_snapshot": { "snapshot_repository": "s3-repo" } } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

Verify: curl -s http://localhost:9200/_ilm/policy/logs-policy | jq . shows the policy is active.

5. Configure metrics collection with Prometheus

Deploy Prometheus with service discovery for your environment (Kubernetes annotations, Consul, or file-based targets). Define recording rules for pre-aggregation and alerting rules for SLO-based alerts. [src2]

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "recording_rules.yml"
  - "alerting_rules.yml"

scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:${2}

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Verify: curl http://localhost:9090/api/v1/targets shows all expected targets as "health": "up".

6. Build Grafana dashboards for visualization and alerting

Create dashboards following the RED method (Rate, Errors, Duration) for services and USE method (Utilization, Saturation, Errors) for infrastructure. Set up alert rules with meaningful thresholds and route to appropriate channels. [src2] [src5]

# Deploy Grafana with provisioned datasources
docker run -d --name grafana \
  -p 3000:3000 \
  -e GF_SECURITY_ADMIN_PASSWORD=changeme \
  -v ./grafana/provisioning:/etc/grafana/provisioning \
  grafana/grafana:11.4.0

# Verify datasource connectivity
curl -u admin:changeme \
  http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up

Verify: Grafana UI at http://localhost:3000 shows data sources as "connected" (green).

Code Examples

Python: Structured Logging with OpenTelemetry Context

# Input:  Application events during request handling
# Output: JSON log lines with trace_id, span_id, and structured fields

import structlog  # structlog==24.4.0
from opentelemetry import trace

# Configure structlog for JSON output with OTel context
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.make_filtering_bound_logger(20),
)

logger = structlog.get_logger()

def process_order(order_id: str, amount: float):
    """Log with automatic trace context propagation."""
    span = trace.get_current_span()
    ctx = span.get_span_context()
    # Bind trace context so every log in this scope includes it
    logger.bind(
        trace_id=format(ctx.trace_id, "032x"),
        span_id=format(ctx.span_id, "016x"),
        order_id=order_id,
    )
    logger.info("order_processing_started", amount=amount)
    try:
        # ... business logic ...
        logger.info("order_processing_completed", amount=amount)
    except Exception as e:
        logger.error("order_processing_failed",
                     error=str(e), error_type=type(e).__name__)
        raise

Node.js: Structured Logging with Pino and OpenTelemetry

// Input:  HTTP requests to an Express service
// Output: JSON log lines with trace context, request metadata

const pino = require("pino");       // [email protected]
const { trace } = require("@opentelemetry/api"); // @opentelemetry/[email protected]

const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  redact: ["req.headers.authorization", "body.password"],
});

function withTraceContext(log) {
  const span = trace.getActiveSpan();
  if (!span) return log;
  const ctx = span.spanContext();
  return log.child({
    trace_id: ctx.traceId,
    span_id: ctx.spanId,
  });
}

// Express middleware example
app.use((req, res, next) => {
  req.log = withTraceContext(logger).child({
    method: req.method,
    path: req.url,
    request_id: req.headers["x-request-id"],
  });
  req.log.info("request_received");
  next();
});

Python: Prometheus Metrics with Labels

# Input:  HTTP request handling in a Flask/FastAPI service
# Output: Prometheus metrics exposed at /metrics endpoint

from prometheus_client import (  # prometheus-client==0.21.1
    Counter, Histogram, Gauge, start_http_server
)

# RED method metrics for services
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"],
)
REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration in seconds",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
# USE method metrics for resources
QUEUE_SIZE = Gauge(
    "task_queue_size",
    "Current number of tasks in the processing queue",
    ["queue_name"],
)

def track_request(method, endpoint, status, duration):
    """Record RED metrics for a completed request."""
    REQUEST_COUNT.labels(
        method=method, endpoint=endpoint, status=status
    ).inc()
    REQUEST_DURATION.labels(
        method=method, endpoint=endpoint
    ).observe(duration)

# Start metrics server on port 9090
start_http_server(9090)

YAML: Complete Docker Compose Observability Stack

# Input:  Docker environment needing full observability
# Output: Running Loki + Prometheus + Tempo + Grafana + OTel Collector

version: "3.9"
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    command: ["--config=/etc/otel/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "13133:13133" # Health check

  loki:
    image: grafana/loki:3.4.0
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki

  prometheus:
    image: prom/prometheus:v3.1.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom-data:/prometheus
    ports:
      - "9090:9090"

  tempo:
    image: grafana/tempo:2.7.0
    command: ["-config.file=/etc/tempo/config.yaml"]
    volumes:
      - ./tempo-config.yaml:/etc/tempo/config.yaml
      - tempo-data:/var/tempo
    ports:
      - "3200:3200"

  grafana:
    image: grafana/grafana:11.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana-data:/var/lib/grafana

volumes:
  loki-data:
  prom-data:
  tempo-data:
  grafana-data:

Anti-Patterns

Wrong: Unstructured string concatenation in logs

# BAD — unstructured text makes automated parsing impossible
import logging
logger = logging.getLogger(__name__)

def process_payment(user_id, amount):
    logger.info("Processing payment for user " + user_id + " amount: $" + str(amount))
    # Output: "Processing payment for user u-123 amount: $49.99"
    # Cannot query by user_id or amount without regex parsing

Correct: Structured JSON logging with typed fields

# GOOD — structured fields are indexable, queryable, and alertable
import structlog
logger = structlog.get_logger()

def process_payment(user_id: str, amount: float):
    logger.info("payment_processing",
                user_id=user_id, amount=amount, currency="USD")
    # Output: {"event":"payment_processing","user_id":"u-123","amount":49.99,"currency":"USD"}
    # Every field is independently queryable

Wrong: Logging sensitive data without redaction

// BAD — PII and secrets in plain text logs
logger.info("User login", {
  email: user.email,           // PII
  password: req.body.password, // credential
  token: session.jwt,          // secret
  ssn: user.socialSecurity,    // regulated data
});

Correct: Redacting sensitive fields before logging

// GOOD — redact sensitive fields, log only safe identifiers
const pino = require("pino");
const logger = pino({
  redact: ["password", "token", "ssn", "*.authorization"],
});

logger.info("User login", {
  user_id: user.id,            // safe identifier
  email: "[REDACTED]",         // or omit entirely
  ip: req.ip,                  // may need consent in EU
  login_method: "password",
});

Wrong: Using a single high-cardinality metric label

# BAD — user_id as a Prometheus label creates millions of time series
# This will crash Prometheus or cause massive memory usage
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Requests",
    ["method", "endpoint", "user_id"],  # user_id = cardinality bomb
)

Correct: Keeping label cardinality bounded

# GOOD — use only bounded-cardinality labels; track per-user in logs/traces
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Requests",
    ["method", "endpoint", "status"],  # all bounded enums
)
# For per-user analytics, use log fields or trace attributes instead
logger.info("request_completed", user_id=user_id, duration_ms=42)

Wrong: Sampling traces at 100% in production

# BAD — storing every single trace in production is extremely expensive
# A service handling 10K req/s generates ~864M spans/day
processors:
  # No sampling configured = 100% of traces stored

Correct: Tail-based sampling for intelligent trace retention

# GOOD — sample based on error status, latency, and a baseline rate
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline-sample
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Common Pitfalls

Diagnostic Commands

# Check OpenTelemetry Collector health
curl -s http://localhost:13133/ | jq .

# Verify Prometheus targets are being scraped
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check Elasticsearch cluster health
curl -s http://localhost:9200/_cluster/health | jq '{status, number_of_nodes, active_shards}'

# Verify Loki is receiving logs
curl -s http://localhost:3100/ready

# Query recent logs from Loki via LogQL
curl -G -s http://localhost:3100/loki/api/v1/query_range \
  --data-urlencode 'query={service_name="payment-service"} |= "error"' \
  --data-urlencode 'limit=10' | jq .

# Check Grafana datasource connectivity
curl -u admin:changeme -s http://localhost:3000/api/datasources | jq '.[].name'

# Verify Tempo is receiving traces
curl -s http://localhost:3200/ready

# Check Prometheus storage TSDB stats
curl -s http://localhost:9090/api/v1/status/tsdb | jq '{headChunks: .data.headStats.numChunks, seriesCount: .data.headStats.numSeries}'

# Monitor OTel Collector pipeline metrics
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted

Version History & Compatibility

ComponentCurrent VersionStatusKey Change
OpenTelemetry Collector0.96.x (2026)Stable (logs GA since 2024-04)Logs signal reached GA; unified pipeline for all three pillars
Prometheus3.x (2025+)CurrentNative OTLP ingestion, UTF-8 metric names, new UI
Elasticsearch8.x (2022+)CurrentLicense changed to SSPL+Elastic; Lucene 9; vector search
Grafana Loki3.x (2024+)CurrentOTLP native ingestion; structured metadata; bloom filters
Grafana Tempo2.x (2023+)CurrentTraceQL query language; vParquet4 format
Grafana11.x (2024+)CurrentUnified alerting; Explore Logs/Traces/Metrics

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Building microservices with >3 services that interactSingle monolith with <1K req/daySimple file-based logging with logrotate
Need to correlate events across services during incidentsStatic websites or JAMstack with no backendCDN analytics (Cloudflare, Vercel)
Compliance requires audit logs with retention guaranteesPrototyping or hackathon with no uptime SLAConsole.log with local files
Operating Kubernetes clusters in productionRunning a single Docker container locallyDocker logs command
Need SLO-based alerting with error budget trackingTeam has no on-call rotation or incident processSimple uptime monitoring (Uptime Robot, Pingdom)

Important Caveats

Related Units