Logging and Monitoring Infrastructure Design
How do I design a logging and monitoring infrastructure?
TL;DR
- Bottom line: A production-grade observability stack requires three pillars — structured logs, time-series metrics, and distributed traces — unified through OpenTelemetry and stored in purpose-built backends (ELK/Loki for logs, Prometheus/Mimir for metrics, Jaeger/Tempo for traces).
- Key tool/command:
otel-collector(OpenTelemetry Collector) as the universal telemetry pipeline — receives, processes, and exports all three signal types. - Watch out for: Logging everything at DEBUG level in production — it will overwhelm storage, spike costs, and obscure real issues in noise.
- Works with: Any language with OpenTelemetry SDK support (Go, Java, Python, Node.js, .NET, Rust, C++, Ruby, PHP, Swift, Erlang); Kubernetes and VM deployments; all major cloud providers.
Constraints
- Never log secrets, tokens, passwords, PII, or credit card numbers — sanitize all sensitive fields before emission
- Retention policies must comply with local data regulations (GDPR defaults to 30-day minimum, SOC2 requires 1-year retention minimum)
- Always use structured (JSON) logging in production — unstructured plaintext breaks automated parsing, indexing, and alerting
- Instrument with OpenTelemetry SDK when possible — it is vendor-neutral and prevents backend lock-in
- Set per-service log-level controls — global DEBUG across all services will cost 5-10x more in storage and processing
Quick Reference
| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Log Collection Agent | Ships logs from hosts/containers to aggregator | Fluent Bit, Filebeat, Vector, OTel Collector | DaemonSet per node (K8s) or sidecar |
| Log Aggregation | Centralizes, parses, enriches log streams | Logstash, Fluentd, OTel Collector | Horizontal replicas behind buffer (Kafka) |
| Log Storage & Search | Indexes and queries log data | Elasticsearch, Grafana Loki, ClickHouse | Elasticsearch: shard-per-index; Loki: label-based partitioning |
| Metrics Collection | Scrapes/receives numeric time-series data | Prometheus, OTel Collector, Telegraf | Federation or Thanos/Mimir for multi-cluster |
| Metrics Storage | Long-term time-series persistence | Prometheus TSDB, Thanos, Mimir, VictoriaMetrics | Remote-write to durable store; compaction + downsampling |
| Trace Collection | Captures distributed request spans | OTel SDK + Collector, Jaeger Agent | Tail-based sampling at Collector tier |
| Trace Storage | Stores and indexes span data | Jaeger, Grafana Tempo, Zipkin | Tempo: object storage (S3/GCS); Jaeger: Elasticsearch/Cassandra |
| Visualization | Dashboards, exploration, correlation | Grafana, Kibana, Datadog UI | Read replicas; CDN for static assets |
| Alerting | Evaluates rules, routes notifications | Alertmanager, Grafana Alerting, PagerDuty | HA pairs with deduplication |
| Buffer/Queue | Decouples producers from consumers | Apache Kafka, Redis Streams, Amazon Kinesis | Kafka: partition-per-topic scaling |
| Service Mesh Telemetry | Auto-instruments inter-service traffic | Istio, Linkerd, Envoy (built-in) | Sidecar proxy per pod |
| Pipeline Orchestrator | Unified telemetry routing and processing | OpenTelemetry Collector, Vector | Gateway mode for centralized processing |
Decision Tree
START: Choose your logging backend
├── Budget < $500/month AND log volume < 50 GB/day?
│ ├── YES → Grafana Loki + Promtail (low resource, label-indexed)
│ └── NO ↓
├── Need full-text search across all log fields?
│ ├── YES → Elasticsearch (ELK Stack) — powerful query language (KQL/Lucene)
│ └── NO ↓
├── Want zero operational overhead?
│ ├── YES → Managed SaaS (Datadog, Splunk Cloud, AWS CloudWatch)
│ └── NO ↓
├── Running on Kubernetes?
│ ├── YES → Loki + OTel Collector DaemonSet (native K8s metadata enrichment)
│ └── NO ↓
├── Log volume > 1 TB/day AND need SQL analytics?
│ ├── YES → ClickHouse + Vector (columnar storage, fast aggregations)
│ └── NO ↓
└── DEFAULT → ELK Stack (most documentation, largest community)
METRICS BACKEND:
├── Already using Grafana?
│ ├── YES → Prometheus + Grafana (native integration)
│ └── NO ↓
├── Multi-cluster / global federation needed?
│ ├── YES → Thanos or Grafana Mimir (long-term, multi-cluster Prometheus)
│ └── NO ↓
└── DEFAULT → Prometheus (pull-based, CNCF graduated, industry standard)
TRACES BACKEND:
├── Want minimal infrastructure?
│ ├── YES → Grafana Tempo (object storage, no indexing required)
│ └── NO ↓
├── Need deep trace analytics and search?
│ ├── YES → Jaeger with Elasticsearch backend
│ └── NO ↓
└── DEFAULT → OTel Collector → Tempo (simplest path)
Step-by-Step Guide
1. Define your telemetry signals and data model
Establish which of the three pillars — logs, metrics, traces — you need from day one. Most production systems need all three. Define a consistent naming convention for metrics (service_name_operation_unit), log fields (timestamp, level, service, trace_id, message), and trace attributes (service.name, deployment.environment). [src1]
# OpenTelemetry resource attributes (define once per service)
resource:
attributes:
service.name: "payment-service"
service.version: "2.1.0"
deployment.environment: "production"
service.namespace: "checkout"
Verify: All services emit service.name and deployment.environment in every telemetry signal.
2. Instrument applications with OpenTelemetry SDKs
Add the OpenTelemetry SDK to each service. Use auto-instrumentation for common frameworks (Express, Flask, Spring Boot) and add manual spans for business-critical paths. [src1]
# Python: install OpenTelemetry packages
pip install opentelemetry-api==1.29.0 \
opentelemetry-sdk==1.29.0 \
opentelemetry-exporter-otlp==1.29.0 \
opentelemetry-instrumentation-flask==0.50b0 \
opentelemetry-instrumentation-requests==0.50b0
# Node.js: install OpenTelemetry packages
npm install @opentelemetry/[email protected] \
@opentelemetry/[email protected] \
@opentelemetry/[email protected] \
@opentelemetry/[email protected]
Verify: curl http://localhost:4318/v1/traces returns 200 from the local OTel Collector.
3. Deploy the OpenTelemetry Collector as the central pipeline
The OTel Collector acts as a vendor-neutral proxy that receives telemetry from all services, processes it (batching, sampling, enrichment), and exports to your chosen backends. Deploy as a DaemonSet in Kubernetes or a sidecar/gateway in VM environments. [src6]
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
attributes:
actions:
- key: environment
value: production
action: upsert
exporters:
otlphttp/loki:
endpoint: "http://loki:3100/otlp"
prometheusremotewrite:
endpoint: "http://mimir:9009/api/v1/push"
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
service:
pipelines:
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/loki]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
Verify: curl -s http://localhost:13133/ returns {"status":"Server available"} from the Collector health check.
4. Set up log storage backend
Choose between Elasticsearch (full-text search), Loki (label-indexed, low cost), or a managed service. Configure index lifecycle management (ILM) to auto-rotate and delete old indices. [src3] [src4]
// Elasticsearch ILM policy example
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": { "actions": { "rollover": { "max_size": "50gb", "max_age": "1d" } } },
"warm": { "min_age": "7d", "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } } },
"cold": { "min_age": "30d", "actions": { "searchable_snapshot": { "snapshot_repository": "s3-repo" } } },
"delete": { "min_age": "90d", "actions": { "delete": {} } }
}
}
}
Verify: curl -s http://localhost:9200/_ilm/policy/logs-policy | jq . shows the policy is active.
5. Configure metrics collection with Prometheus
Deploy Prometheus with service discovery for your environment (Kubernetes annotations, Consul, or file-based targets). Define recording rules for pre-aggregation and alerting rules for SLO-based alerts. [src2]
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "recording_rules.yml"
- "alerting_rules.yml"
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:${2}
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
Verify: curl http://localhost:9090/api/v1/targets shows all expected targets as "health": "up".
6. Build Grafana dashboards for visualization and alerting
Create dashboards following the RED method (Rate, Errors, Duration) for services and USE method (Utilization, Saturation, Errors) for infrastructure. Set up alert rules with meaningful thresholds and route to appropriate channels. [src2] [src5]
# Deploy Grafana with provisioned datasources
docker run -d --name grafana \
-p 3000:3000 \
-e GF_SECURITY_ADMIN_PASSWORD=changeme \
-v ./grafana/provisioning:/etc/grafana/provisioning \
grafana/grafana:11.4.0
# Verify datasource connectivity
curl -u admin:changeme \
http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up
Verify: Grafana UI at http://localhost:3000 shows data sources as "connected" (green).
Code Examples
Python: Structured Logging with OpenTelemetry Context
# Input: Application events during request handling
# Output: JSON log lines with trace_id, span_id, and structured fields
import structlog # structlog==24.4.0
from opentelemetry import trace
# Configure structlog for JSON output with OTel context
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.make_filtering_bound_logger(20),
)
logger = structlog.get_logger()
def process_order(order_id: str, amount: float):
"""Log with automatic trace context propagation."""
span = trace.get_current_span()
ctx = span.get_span_context()
# Bind trace context so every log in this scope includes it
logger.bind(
trace_id=format(ctx.trace_id, "032x"),
span_id=format(ctx.span_id, "016x"),
order_id=order_id,
)
logger.info("order_processing_started", amount=amount)
try:
# ... business logic ...
logger.info("order_processing_completed", amount=amount)
except Exception as e:
logger.error("order_processing_failed",
error=str(e), error_type=type(e).__name__)
raise
Node.js: Structured Logging with Pino and OpenTelemetry
// Input: HTTP requests to an Express service
// Output: JSON log lines with trace context, request metadata
const pino = require("pino"); // [email protected]
const { trace } = require("@opentelemetry/api"); // @opentelemetry/[email protected]
const logger = pino({
level: process.env.LOG_LEVEL || "info",
formatters: {
level: (label) => ({ level: label }),
},
timestamp: pino.stdTimeFunctions.isoTime,
redact: ["req.headers.authorization", "body.password"],
});
function withTraceContext(log) {
const span = trace.getActiveSpan();
if (!span) return log;
const ctx = span.spanContext();
return log.child({
trace_id: ctx.traceId,
span_id: ctx.spanId,
});
}
// Express middleware example
app.use((req, res, next) => {
req.log = withTraceContext(logger).child({
method: req.method,
path: req.url,
request_id: req.headers["x-request-id"],
});
req.log.info("request_received");
next();
});
Python: Prometheus Metrics with Labels
# Input: HTTP request handling in a Flask/FastAPI service
# Output: Prometheus metrics exposed at /metrics endpoint
from prometheus_client import ( # prometheus-client==0.21.1
Counter, Histogram, Gauge, start_http_server
)
# RED method metrics for services
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"],
)
REQUEST_DURATION = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
["method", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
# USE method metrics for resources
QUEUE_SIZE = Gauge(
"task_queue_size",
"Current number of tasks in the processing queue",
["queue_name"],
)
def track_request(method, endpoint, status, duration):
"""Record RED metrics for a completed request."""
REQUEST_COUNT.labels(
method=method, endpoint=endpoint, status=status
).inc()
REQUEST_DURATION.labels(
method=method, endpoint=endpoint
).observe(duration)
# Start metrics server on port 9090
start_http_server(9090)
YAML: Complete Docker Compose Observability Stack
# Input: Docker environment needing full observability
# Output: Running Loki + Prometheus + Tempo + Grafana + OTel Collector
version: "3.9"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
command: ["--config=/etc/otel/config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel/config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "13133:13133" # Health check
loki:
image: grafana/loki:3.4.0
ports:
- "3100:3100"
volumes:
- loki-data:/loki
prometheus:
image: prom/prometheus:v3.1.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prom-data:/prometheus
ports:
- "9090:9090"
tempo:
image: grafana/tempo:2.7.0
command: ["-config.file=/etc/tempo/config.yaml"]
volumes:
- ./tempo-config.yaml:/etc/tempo/config.yaml
- tempo-data:/var/tempo
ports:
- "3200:3200"
grafana:
image: grafana/grafana:11.4.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- grafana-data:/var/lib/grafana
volumes:
loki-data:
prom-data:
tempo-data:
grafana-data:
Anti-Patterns
Wrong: Unstructured string concatenation in logs
# BAD — unstructured text makes automated parsing impossible
import logging
logger = logging.getLogger(__name__)
def process_payment(user_id, amount):
logger.info("Processing payment for user " + user_id + " amount: $" + str(amount))
# Output: "Processing payment for user u-123 amount: $49.99"
# Cannot query by user_id or amount without regex parsing
Correct: Structured JSON logging with typed fields
# GOOD — structured fields are indexable, queryable, and alertable
import structlog
logger = structlog.get_logger()
def process_payment(user_id: str, amount: float):
logger.info("payment_processing",
user_id=user_id, amount=amount, currency="USD")
# Output: {"event":"payment_processing","user_id":"u-123","amount":49.99,"currency":"USD"}
# Every field is independently queryable
Wrong: Logging sensitive data without redaction
// BAD — PII and secrets in plain text logs
logger.info("User login", {
email: user.email, // PII
password: req.body.password, // credential
token: session.jwt, // secret
ssn: user.socialSecurity, // regulated data
});
Correct: Redacting sensitive fields before logging
// GOOD — redact sensitive fields, log only safe identifiers
const pino = require("pino");
const logger = pino({
redact: ["password", "token", "ssn", "*.authorization"],
});
logger.info("User login", {
user_id: user.id, // safe identifier
email: "[REDACTED]", // or omit entirely
ip: req.ip, // may need consent in EU
login_method: "password",
});
Wrong: Using a single high-cardinality metric label
# BAD — user_id as a Prometheus label creates millions of time series
# This will crash Prometheus or cause massive memory usage
REQUEST_COUNT = Counter(
"http_requests_total",
"Requests",
["method", "endpoint", "user_id"], # user_id = cardinality bomb
)
Correct: Keeping label cardinality bounded
# GOOD — use only bounded-cardinality labels; track per-user in logs/traces
REQUEST_COUNT = Counter(
"http_requests_total",
"Requests",
["method", "endpoint", "status"], # all bounded enums
)
# For per-user analytics, use log fields or trace attributes instead
logger.info("request_completed", user_id=user_id, duration_ms=42)
Wrong: Sampling traces at 100% in production
# BAD — storing every single trace in production is extremely expensive
# A service handling 10K req/s generates ~864M spans/day
processors:
# No sampling configured = 100% of traces stored
Correct: Tail-based sampling for intelligent trace retention
# GOOD — sample based on error status, latency, and a baseline rate
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 1000 }
- name: baseline-sample
type: probabilistic
probabilistic: { sampling_percentage: 5 }
Common Pitfalls
- High-cardinality label explosion: Using unbounded values (user IDs, request IDs, URLs with query params) as Prometheus labels causes memory exhaustion. Fix: keep label values to bounded enums; use
relabel_configsto drop or hash high-cardinality labels. [src2] - No log-level gating in production: Emitting DEBUG logs for all services in production generates 5-10x more data, spiking storage costs. Fix: set default to
WARNorINFOper service; use dynamic log-level adjustment via feature flags or config reload. [src7] - Missing correlation IDs across signals: Logs, metrics, and traces without shared identifiers (trace_id) are impossible to correlate during incidents. Fix: propagate
trace_idvia OpenTelemetry context; inject into all structured log fields automatically. [src1] - Single-node Elasticsearch with no ILM: Running Elasticsearch on a single node with default settings leads to disk exhaustion within weeks. Fix: configure index lifecycle management (ILM) with hot/warm/cold/delete phases from day one. [src3]
- Alerting on raw metrics instead of SLOs: Alerts on CPU > 80% or error count > 100 produce constant false positives. Fix: define SLI/SLO-based alerts (e.g., error budget burn rate > 2x) that reflect actual user impact. [src5]
- No buffer between producers and consumers: A direct Filebeat-to-Elasticsearch pipeline means Elasticsearch downtime causes log loss. Fix: insert Kafka or Redis Streams as a buffer to decouple ingestion from storage. [src3]
- Ignoring log rotation and retention: Unbounded log storage fills disks, causes outages, and violates compliance policies. Fix: configure logrotate locally, ILM in Elasticsearch, and retention rules in Loki (
retention_period: 720h). [src4] - Alert fatigue from noisy thresholds: Too many low-priority alerts desensitize the on-call team. Fix: classify alerts by severity (critical/warning/info), route only critical to PagerDuty, aggregate warnings into daily digests. [src7]
Diagnostic Commands
# Check OpenTelemetry Collector health
curl -s http://localhost:13133/ | jq .
# Verify Prometheus targets are being scraped
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Check Elasticsearch cluster health
curl -s http://localhost:9200/_cluster/health | jq '{status, number_of_nodes, active_shards}'
# Verify Loki is receiving logs
curl -s http://localhost:3100/ready
# Query recent logs from Loki via LogQL
curl -G -s http://localhost:3100/loki/api/v1/query_range \
--data-urlencode 'query={service_name="payment-service"} |= "error"' \
--data-urlencode 'limit=10' | jq .
# Check Grafana datasource connectivity
curl -u admin:changeme -s http://localhost:3000/api/datasources | jq '.[].name'
# Verify Tempo is receiving traces
curl -s http://localhost:3200/ready
# Check Prometheus storage TSDB stats
curl -s http://localhost:9090/api/v1/status/tsdb | jq '{headChunks: .data.headStats.numChunks, seriesCount: .data.headStats.numSeries}'
# Monitor OTel Collector pipeline metrics
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted
Version History & Compatibility
| Component | Current Version | Status | Key Change |
|---|---|---|---|
| OpenTelemetry Collector | 0.96.x (2026) | Stable (logs GA since 2024-04) | Logs signal reached GA; unified pipeline for all three pillars |
| Prometheus | 3.x (2025+) | Current | Native OTLP ingestion, UTF-8 metric names, new UI |
| Elasticsearch | 8.x (2022+) | Current | License changed to SSPL+Elastic; Lucene 9; vector search |
| Grafana Loki | 3.x (2024+) | Current | OTLP native ingestion; structured metadata; bloom filters |
| Grafana Tempo | 2.x (2023+) | Current | TraceQL query language; vParquet4 format |
| Grafana | 11.x (2024+) | Current | Unified alerting; Explore Logs/Traces/Metrics |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Building microservices with >3 services that interact | Single monolith with <1K req/day | Simple file-based logging with logrotate |
| Need to correlate events across services during incidents | Static websites or JAMstack with no backend | CDN analytics (Cloudflare, Vercel) |
| Compliance requires audit logs with retention guarantees | Prototyping or hackathon with no uptime SLA | Console.log with local files |
| Operating Kubernetes clusters in production | Running a single Docker container locally | Docker logs command |
| Need SLO-based alerting with error budget tracking | Team has no on-call rotation or incident process | Simple uptime monitoring (Uptime Robot, Pingdom) |
Important Caveats
- OpenTelemetry SDK auto-instrumentation adds 1-5% latency overhead per service; benchmark before enabling in latency-sensitive hot paths
- Elasticsearch 8.x uses the SSPL license (not Apache 2.0) — verify legal compliance for your use case; OpenSearch is the Apache 2.0 fork
- Prometheus is pull-based by default — for short-lived jobs (serverless, batch), use the Pushgateway or switch to OTel push-based collection
- Loki does not index log content (only labels) — complex full-text searches across log bodies are significantly slower than Elasticsearch
- Managed observability services (Datadog, New Relic, Splunk) cost 3-10x more than self-hosted at scale but eliminate operational overhead — break-even is typically at 50-200 GB/day depending on team size
- Trace sampling decisions must be made carefully: 100% sampling is prohibitively expensive, but too-aggressive sampling will miss rare errors