otel-collector (OpenTelemetry Collector) as the universal telemetry pipeline — receives, processes, and exports all three signal types.| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Log Collection Agent | Ships logs from hosts/containers to aggregator | Fluent Bit, Filebeat, Vector, OTel Collector | DaemonSet per node (K8s) or sidecar |
| Log Aggregation | Centralizes, parses, enriches log streams | Logstash, Fluentd, OTel Collector | Horizontal replicas behind buffer (Kafka) |
| Log Storage & Search | Indexes and queries log data | Elasticsearch, Grafana Loki, ClickHouse | Elasticsearch: shard-per-index; Loki: label-based partitioning |
| Metrics Collection | Scrapes/receives numeric time-series data | Prometheus, OTel Collector, Telegraf | Federation or Thanos/Mimir for multi-cluster |
| Metrics Storage | Long-term time-series persistence | Prometheus TSDB, Thanos, Mimir, VictoriaMetrics | Remote-write to durable store; compaction + downsampling |
| Trace Collection | Captures distributed request spans | OTel SDK + Collector, Jaeger Agent | Tail-based sampling at Collector tier |
| Trace Storage | Stores and indexes span data | Jaeger, Grafana Tempo, Zipkin | Tempo: object storage (S3/GCS); Jaeger: Elasticsearch/Cassandra |
| Visualization | Dashboards, exploration, correlation | Grafana, Kibana, Datadog UI | Read replicas; CDN for static assets |
| Alerting | Evaluates rules, routes notifications | Alertmanager, Grafana Alerting, PagerDuty | HA pairs with deduplication |
| Buffer/Queue | Decouples producers from consumers | Apache Kafka, Redis Streams, Amazon Kinesis | Kafka: partition-per-topic scaling |
| Service Mesh Telemetry | Auto-instruments inter-service traffic | Istio, Linkerd, Envoy (built-in) | Sidecar proxy per pod |
| Pipeline Orchestrator | Unified telemetry routing and processing | OpenTelemetry Collector, Vector | Gateway mode for centralized processing |
START: Choose your logging backend
├── Budget < $500/month AND log volume < 50 GB/day?
│ ├── YES → Grafana Loki + Promtail (low resource, label-indexed)
│ └── NO ↓
├── Need full-text search across all log fields?
│ ├── YES → Elasticsearch (ELK Stack) — powerful query language (KQL/Lucene)
│ └── NO ↓
├── Want zero operational overhead?
│ ├── YES → Managed SaaS (Datadog, Splunk Cloud, AWS CloudWatch)
│ └── NO ↓
├── Running on Kubernetes?
│ ├── YES → Loki + OTel Collector DaemonSet (native K8s metadata enrichment)
│ └── NO ↓
├── Log volume > 1 TB/day AND need SQL analytics?
│ ├── YES → ClickHouse + Vector (columnar storage, fast aggregations)
│ └── NO ↓
└── DEFAULT → ELK Stack (most documentation, largest community)
METRICS BACKEND:
├── Already using Grafana?
│ ├── YES → Prometheus + Grafana (native integration)
│ └── NO ↓
├── Multi-cluster / global federation needed?
│ ├── YES → Thanos or Grafana Mimir (long-term, multi-cluster Prometheus)
│ └── NO ↓
└── DEFAULT → Prometheus (pull-based, CNCF graduated, industry standard)
TRACES BACKEND:
├── Want minimal infrastructure?
│ ├── YES → Grafana Tempo (object storage, no indexing required)
│ └── NO ↓
├── Need deep trace analytics and search?
│ ├── YES → Jaeger with Elasticsearch backend
│ └── NO ↓
└── DEFAULT → OTel Collector → Tempo (simplest path)
Establish which of the three pillars — logs, metrics, traces — you need from day one. Most production systems need all three. Define a consistent naming convention for metrics (service_name_operation_unit), log fields (timestamp, level, service, trace_id, message), and trace attributes (service.name, deployment.environment). [src1]
# OpenTelemetry resource attributes (define once per service)
resource:
attributes:
service.name: "payment-service"
service.version: "2.1.0"
deployment.environment: "production"
service.namespace: "checkout"
Verify: All services emit service.name and deployment.environment in every telemetry signal.
Add the OpenTelemetry SDK to each service. Use auto-instrumentation for common frameworks (Express, Flask, Spring Boot) and add manual spans for business-critical paths. [src1]
# Python: install OpenTelemetry packages
pip install opentelemetry-api==1.29.0 \
opentelemetry-sdk==1.29.0 \
opentelemetry-exporter-otlp==1.29.0 \
opentelemetry-instrumentation-flask==0.50b0 \
opentelemetry-instrumentation-requests==0.50b0
# Node.js: install OpenTelemetry packages
npm install @opentelemetry/[email protected] \
@opentelemetry/[email protected] \
@opentelemetry/[email protected] \
@opentelemetry/[email protected]
Verify: curl http://localhost:4318/v1/traces returns 200 from the local OTel Collector.
The OTel Collector acts as a vendor-neutral proxy that receives telemetry from all services, processes it (batching, sampling, enrichment), and exports to your chosen backends. Deploy as a DaemonSet in Kubernetes or a sidecar/gateway in VM environments. [src6]
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
attributes:
actions:
- key: environment
value: production
action: upsert
exporters:
otlphttp/loki:
endpoint: "http://loki:3100/otlp"
prometheusremotewrite:
endpoint: "http://mimir:9009/api/v1/push"
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
service:
pipelines:
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/loki]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
Verify: curl -s http://localhost:13133/ returns {"status":"Server available"} from the Collector health check.
Choose between Elasticsearch (full-text search), Loki (label-indexed, low cost), or a managed service. Configure index lifecycle management (ILM) to auto-rotate and delete old indices. [src3] [src4]
// Elasticsearch ILM policy example
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": { "actions": { "rollover": { "max_size": "50gb", "max_age": "1d" } } },
"warm": { "min_age": "7d", "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } } },
"cold": { "min_age": "30d", "actions": { "searchable_snapshot": { "snapshot_repository": "s3-repo" } } },
"delete": { "min_age": "90d", "actions": { "delete": {} } }
}
}
}
Verify: curl -s http://localhost:9200/_ilm/policy/logs-policy | jq . shows the policy is active.
Deploy Prometheus with service discovery for your environment (Kubernetes annotations, Consul, or file-based targets). Define recording rules for pre-aggregation and alerting rules for SLO-based alerts. [src2]
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "recording_rules.yml"
- "alerting_rules.yml"
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:${2}
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
Verify: curl http://localhost:9090/api/v1/targets shows all expected targets as "health": "up".
Create dashboards following the RED method (Rate, Errors, Duration) for services and USE method (Utilization, Saturation, Errors) for infrastructure. Set up alert rules with meaningful thresholds and route to appropriate channels. [src2] [src5]
# Deploy Grafana with provisioned datasources
docker run -d --name grafana \
-p 3000:3000 \
-e GF_SECURITY_ADMIN_PASSWORD=changeme \
-v ./grafana/provisioning:/etc/grafana/provisioning \
grafana/grafana:11.4.0
# Verify datasource connectivity
curl -u admin:changeme \
http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up
Verify: Grafana UI at http://localhost:3000 shows data sources as "connected" (green).
# Input: Application events during request handling
# Output: JSON log lines with trace_id, span_id, and structured fields
import structlog # structlog==24.4.0
from opentelemetry import trace
# Configure structlog for JSON output with OTel context
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.make_filtering_bound_logger(20),
)
logger = structlog.get_logger()
def process_order(order_id: str, amount: float):
"""Log with automatic trace context propagation."""
span = trace.get_current_span()
ctx = span.get_span_context()
# Bind trace context so every log in this scope includes it
logger.bind(
trace_id=format(ctx.trace_id, "032x"),
span_id=format(ctx.span_id, "016x"),
order_id=order_id,
)
logger.info("order_processing_started", amount=amount)
try:
# ... business logic ...
logger.info("order_processing_completed", amount=amount)
except Exception as e:
logger.error("order_processing_failed",
error=str(e), error_type=type(e).__name__)
raise
// Input: HTTP requests to an Express service
// Output: JSON log lines with trace context, request metadata
const pino = require("pino"); // [email protected]
const { trace } = require("@opentelemetry/api"); // @opentelemetry/[email protected]
const logger = pino({
level: process.env.LOG_LEVEL || "info",
formatters: {
level: (label) => ({ level: label }),
},
timestamp: pino.stdTimeFunctions.isoTime,
redact: ["req.headers.authorization", "body.password"],
});
function withTraceContext(log) {
const span = trace.getActiveSpan();
if (!span) return log;
const ctx = span.spanContext();
return log.child({
trace_id: ctx.traceId,
span_id: ctx.spanId,
});
}
// Express middleware example
app.use((req, res, next) => {
req.log = withTraceContext(logger).child({
method: req.method,
path: req.url,
request_id: req.headers["x-request-id"],
});
req.log.info("request_received");
next();
});
# Input: HTTP request handling in a Flask/FastAPI service
# Output: Prometheus metrics exposed at /metrics endpoint
from prometheus_client import ( # prometheus-client==0.21.1
Counter, Histogram, Gauge, start_http_server
)
# RED method metrics for services
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"],
)
REQUEST_DURATION = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
["method", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
# USE method metrics for resources
QUEUE_SIZE = Gauge(
"task_queue_size",
"Current number of tasks in the processing queue",
["queue_name"],
)
def track_request(method, endpoint, status, duration):
"""Record RED metrics for a completed request."""
REQUEST_COUNT.labels(
method=method, endpoint=endpoint, status=status
).inc()
REQUEST_DURATION.labels(
method=method, endpoint=endpoint
).observe(duration)
# Start metrics server on port 9090
start_http_server(9090)
# Input: Docker environment needing full observability
# Output: Running Loki + Prometheus + Tempo + Grafana + OTel Collector
version: "3.9"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
command: ["--config=/etc/otel/config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel/config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "13133:13133" # Health check
loki:
image: grafana/loki:3.4.0
ports:
- "3100:3100"
volumes:
- loki-data:/loki
prometheus:
image: prom/prometheus:v3.1.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prom-data:/prometheus
ports:
- "9090:9090"
tempo:
image: grafana/tempo:2.7.0
command: ["-config.file=/etc/tempo/config.yaml"]
volumes:
- ./tempo-config.yaml:/etc/tempo/config.yaml
- tempo-data:/var/tempo
ports:
- "3200:3200"
grafana:
image: grafana/grafana:11.4.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- grafana-data:/var/lib/grafana
volumes:
loki-data:
prom-data:
tempo-data:
grafana-data:
# BAD — unstructured text makes automated parsing impossible
import logging
logger = logging.getLogger(__name__)
def process_payment(user_id, amount):
logger.info("Processing payment for user " + user_id + " amount: $" + str(amount))
# Output: "Processing payment for user u-123 amount: $49.99"
# Cannot query by user_id or amount without regex parsing
# GOOD — structured fields are indexable, queryable, and alertable
import structlog
logger = structlog.get_logger()
def process_payment(user_id: str, amount: float):
logger.info("payment_processing",
user_id=user_id, amount=amount, currency="USD")
# Output: {"event":"payment_processing","user_id":"u-123","amount":49.99,"currency":"USD"}
# Every field is independently queryable
// BAD — PII and secrets in plain text logs
logger.info("User login", {
email: user.email, // PII
password: req.body.password, // credential
token: session.jwt, // secret
ssn: user.socialSecurity, // regulated data
});
// GOOD — redact sensitive fields, log only safe identifiers
const pino = require("pino");
const logger = pino({
redact: ["password", "token", "ssn", "*.authorization"],
});
logger.info("User login", {
user_id: user.id, // safe identifier
email: "[REDACTED]", // or omit entirely
ip: req.ip, // may need consent in EU
login_method: "password",
});
# BAD — user_id as a Prometheus label creates millions of time series
# This will crash Prometheus or cause massive memory usage
REQUEST_COUNT = Counter(
"http_requests_total",
"Requests",
["method", "endpoint", "user_id"], # user_id = cardinality bomb
)
# GOOD — use only bounded-cardinality labels; track per-user in logs/traces
REQUEST_COUNT = Counter(
"http_requests_total",
"Requests",
["method", "endpoint", "status"], # all bounded enums
)
# For per-user analytics, use log fields or trace attributes instead
logger.info("request_completed", user_id=user_id, duration_ms=42)
# BAD — storing every single trace in production is extremely expensive
# A service handling 10K req/s generates ~864M spans/day
processors:
# No sampling configured = 100% of traces stored
# GOOD — sample based on error status, latency, and a baseline rate
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 1000 }
- name: baseline-sample
type: probabilistic
probabilistic: { sampling_percentage: 5 }
relabel_configs to drop or hash high-cardinality labels. [src2]WARN or INFO per service; use dynamic log-level adjustment via feature flags or config reload. [src7]trace_id via OpenTelemetry context; inject into all structured log fields automatically. [src1]retention_period: 720h). [src4]# Check OpenTelemetry Collector health
curl -s http://localhost:13133/ | jq .
# Verify Prometheus targets are being scraped
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Check Elasticsearch cluster health
curl -s http://localhost:9200/_cluster/health | jq '{status, number_of_nodes, active_shards}'
# Verify Loki is receiving logs
curl -s http://localhost:3100/ready
# Query recent logs from Loki via LogQL
curl -G -s http://localhost:3100/loki/api/v1/query_range \
--data-urlencode 'query={service_name="payment-service"} |= "error"' \
--data-urlencode 'limit=10' | jq .
# Check Grafana datasource connectivity
curl -u admin:changeme -s http://localhost:3000/api/datasources | jq '.[].name'
# Verify Tempo is receiving traces
curl -s http://localhost:3200/ready
# Check Prometheus storage TSDB stats
curl -s http://localhost:9090/api/v1/status/tsdb | jq '{headChunks: .data.headStats.numChunks, seriesCount: .data.headStats.numSeries}'
# Monitor OTel Collector pipeline metrics
curl -s http://localhost:8888/metrics | grep otelcol_receiver_accepted
| Component | Current Version | Status | Key Change |
|---|---|---|---|
| OpenTelemetry Collector | 0.96.x (2026) | Stable (logs GA since 2024-04) | Logs signal reached GA; unified pipeline for all three pillars |
| Prometheus | 3.x (2025+) | Current | Native OTLP ingestion, UTF-8 metric names, new UI |
| Elasticsearch | 8.x (2022+) | Current | License changed to SSPL+Elastic; Lucene 9; vector search |
| Grafana Loki | 3.x (2024+) | Current | OTLP native ingestion; structured metadata; bloom filters |
| Grafana Tempo | 2.x (2023+) | Current | TraceQL query language; vParquet4 format |
| Grafana | 11.x (2024+) | Current | Unified alerting; Explore Logs/Traces/Metrics |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Building microservices with >3 services that interact | Single monolith with <1K req/day | Simple file-based logging with logrotate |
| Need to correlate events across services during incidents | Static websites or JAMstack with no backend | CDN analytics (Cloudflare, Vercel) |
| Compliance requires audit logs with retention guarantees | Prototyping or hackathon with no uptime SLA | Console.log with local files |
| Operating Kubernetes clusters in production | Running a single Docker container locally | Docker logs command |
| Need SLO-based alerting with error budget tracking | Team has no on-call rotation or incident process | Simple uptime monitoring (Uptime Robot, Pingdom) |