docker compose up -d with pinned image versions and named volumes for all stateful services.| Service | Image | Ports | Volumes | Key Config |
|---|---|---|---|---|
| Prometheus | prom/prometheus:v3.10.0 | 9090:9090 | prometheus_data:/prometheus, ./prometheus.yml:/etc/prometheus/prometheus.yml | --storage.tsdb.retention.time=30d |
| Grafana | grafana/grafana:12.4.0 | 3000:3000 | grafana_data:/var/lib/grafana, ./grafana/provisioning:/etc/grafana/provisioning | GF_SECURITY_ADMIN_PASSWORD |
| Node Exporter | prom/node-exporter:v1.9.0 | 9100:9100 | /:/host:ro,rslave | --path.rootfs=/host, PID host |
| cAdvisor | gcr.io/cadvisor/cadvisor:v0.49.1 | 8080:8080 | /var/run:/var/run:ro, /sys:/sys:ro, /var/lib/docker:/var/lib/docker:ro | Privileged mounts required |
| Alertmanager | prom/alertmanager:v0.28.1 | 9093:9093 | alertmanager_data:/alertmanager, ./alertmanager.yml:/etc/alertmanager/alertmanager.yml | Route + receiver config |
| Endpoint | URL | Purpose |
|---|---|---|
| Prometheus UI | http://localhost:9090 | Query, targets, rules, TSDB status |
| Prometheus Targets | http://localhost:9090/targets | Scrape target health check |
| Grafana UI | http://localhost:3000 | Dashboards (default: admin/admin) |
| Alertmanager UI | http://localhost:9093 | Alert status, silences |
| Node Exporter Metrics | http://localhost:9100/metrics | Raw host metrics |
| cAdvisor UI | http://localhost:8080 | Container metrics explorer |
START: What monitoring do you need?
├── Host metrics only (CPU, RAM, disk, network)?
│ ├── YES → Prometheus + Node Exporter + Grafana (skip cAdvisor)
│ └── NO ↓
├── Docker container metrics only?
│ ├── YES → Prometheus + cAdvisor + Grafana (skip Node Exporter)
│ └── NO ↓
├── Both host + container metrics?
│ ├── YES → Full stack: Prometheus + Node Exporter + cAdvisor + Grafana
│ └── NO ↓
├── Need alerting (email, Slack, PagerDuty)?
│ ├── YES → Add Alertmanager service + alert rules
│ └── NO → Skip Alertmanager
├── Running on Kubernetes?
│ ├── YES → Use kube-prometheus-stack Helm chart instead
│ └── NO ↓
└── DEFAULT → Full stack with all 5 services
Organize configuration files into service-specific directories for clarity. [src5]
mkdir -p monitoring/{prometheus,grafana/provisioning/datasources,grafana/provisioning/dashboards,alertmanager}
cd monitoring
Verify: find . -type d → should list all subdirectories.
Define all services with pinned versions, named volumes, health checks, and a shared network. [src1]
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:v3.10.0
container_name: prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle"
- "--storage.tsdb.wal-compression"
volumes:
- prometheus_data:/prometheus
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules.yml:/etc/prometheus/rules.yml:ro
ports:
- "9090:9090"
restart: unless-stopped
networks:
- monitoring
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 5s
retries: 3
grafana:
image: grafana/grafana:12.4.0
container_name: grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-changeme}
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
ports:
- "3000:3000"
restart: unless-stopped
networks:
- monitoring
depends_on:
prometheus:
condition: service_healthy
node-exporter:
image: prom/node-exporter:v1.9.0
container_name: node-exporter
command:
- "--path.rootfs=/host"
volumes:
- "/:/host:ro,rslave"
pid: host
restart: unless-stopped
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
privileged: true
restart: unless-stopped
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.28.1
container_name: alertmanager
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
volumes:
- alertmanager_data:/alertmanager
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"
restart: unless-stopped
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
networks:
monitoring:
driver: bridge
Verify: docker compose config → should print resolved YAML with no errors.
Define scrape jobs for all exporters. Use Docker DNS names (service names resolve automatically within the compose network). [src1]
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "rules.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "cadvisor"
scrape_interval: 10s
static_configs:
- targets: ["cadvisor:8080"]
Verify: After starting, visit http://localhost:9090/targets → all targets should show UP.
Define alert conditions for common failure scenarios. [src5]
# prometheus/rules.yml
groups:
- name: node_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 90
for: 10m
labels:
severity: critical
- name: container_alerts
rules:
- alert: ContainerHighCPU
expr: rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100 > 80
for: 5m
labels:
severity: warning
Verify: http://localhost:9090/rules → rules should appear as loaded.
Set up routing and receivers for notifications. [src5]
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "severity"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "default"
receivers:
- name: "default"
webhook_configs:
- url: "http://example.com/webhook"
send_resolved: true
Verify: http://localhost:9093/#/status → config should be loaded.
Use Grafana provisioning to auto-configure Prometheus as a datasource. [src3]
# grafana/provisioning/datasources/datasource.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: ""
type: file
disableDeletion: false
updateIntervalSeconds: 30
options:
path: /etc/grafana/provisioning/dashboards
foldersFromFilesStructure: false
Verify: Log into Grafana at http://localhost:3000 → Configuration > Data Sources → Prometheus should appear.
Launch all services and confirm everything is healthy. [src5]
# Start all services
docker compose up -d
# Check all containers are running
docker compose ps
# View logs for any errors
docker compose logs --tail=50
Verify: docker compose ps → all 5 services should show running.
{
"dashboard": {
"title": "Node Exporter - Host Overview",
"uid": "node-exporter-host",
"panels": [
{
"title": "CPU Usage %",
"type": "timeseries",
"targets": [{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU %"
}],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}
},
{
"title": "Memory Usage %",
"type": "timeseries",
"targets": [{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "Memory %"
}],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
}
],
"time": {"from": "now-1h", "to": "now"},
"refresh": "30s"
}
}
# prometheus/recording-rules.yml
groups:
- name: node_recording_rules
interval: 15s
rules:
- record: instance:node_cpu_utilization:ratio
expr: 1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
- record: instance:node_memory_utilization:ratio
expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
- record: instance:node_disk_utilization:ratio
expr: 1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
# BAD -- unpinned versions cause silent breaking changes
services:
prometheus:
image: prom/prometheus:latest
grafana:
image: grafana/grafana:latest
node-exporter:
image: prom/node-exporter:latest
# GOOD -- predictable, reproducible deployments
services:
prometheus:
image: prom/prometheus:v3.10.0
grafana:
image: grafana/grafana:12.4.0
node-exporter:
image: prom/node-exporter:v1.9.0
# BAD -- all metrics data lost on container restart
services:
prometheus:
image: prom/prometheus:v3.10.0
# No volumes defined
# GOOD -- data survives container restarts and upgrades
services:
prometheus:
image: prom/prometheus:v3.10.0
volumes:
- prometheus_data:/prometheus
volumes:
prometheus_data:
# BAD -- anyone can query/delete metrics
services:
prometheus:
ports:
- "0.0.0.0:9090:9090"
# GOOD -- only accessible locally
services:
prometheus:
ports:
- "127.0.0.1:9090:9090"
# BAD -- incomplete host metrics
services:
node-exporter:
image: prom/node-exporter:v1.9.0
# Missing pid: host and /:/host volume
# GOOD -- accurate host metrics with rootfs remapping
services:
node-exporter:
image: prom/node-exporter:v1.9.0
command:
- "--path.rootfs=/host"
volumes:
- "/:/host:ro,rslave"
pid: host
# BAD -- password committed to version control
services:
grafana:
environment:
- GF_SECURITY_ADMIN_PASSWORD=mysecretpassword
# GOOD -- password loaded from .env file (excluded from git)
services:
grafana:
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
node-exporter:9100 not node_exporter:9100 or localhost:9100. Fix: Check docker compose ps for exact service names. [src5]http://prometheus:9090, not http://localhost:9090. Fix: Use service name in datasource URL. [src3]Failed to start container manager on Linux 6.x+ kernels. Fix: Add --docker_only=true flag or update to cAdvisor v0.49+. [src2]metric_relabel_configs to drop high-cardinality labels, or set --storage.tsdb.retention.size=5GB. [src1]rule_files path in prometheus.yml must match the mounted path inside the container. Fix: Verify mount paths match rule_files paths. [src5]editable: true in datasource config, or export and re-provision. [src3]--path.rootfs=/host flag or volume mount. Fix: Ensure both the volume mount and the flag are set. [src6]alertmanager:9093, not localhost:9093. Fix: Use Docker service names. [src5]# Check all service health
docker compose ps
# View Prometheus targets status
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {scrapeUrl, health, lastError}'
# Test Prometheus config validity
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml
# Validate alert rules
docker compose exec prometheus promtool check rules /etc/prometheus/rules.yml
# Check Grafana datasource connectivity
curl -s -u admin:changeme http://localhost:3000/api/datasources | jq '.[].name'
# Test Alertmanager config
docker compose exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml
# Query a specific metric directly
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result'
# Reload Prometheus config without restart (requires --web.enable-lifecycle)
curl -X POST http://localhost:9090/-/reload
# Check container resource usage
docker stats --no-stream
| Component | Version | Status | Breaking Changes | Notes |
|---|---|---|---|---|
| Prometheus | 3.10.0 | Current | 3.0 removed deprecated flags, UTF-8 metric names | LTS: 3.5.1 |
| Prometheus | 2.54.x | LTS until 2025-07 | -- | Last 2.x LTS |
| Grafana | 12.4.0 | Current | 12.0 changed auth defaults | Unified alerting is default |
| Node Exporter | 1.9.0 | Current | None | -- |
| cAdvisor | 0.49.1 | Current | 0.47+ requires Linux 5.4+ | Google-maintained |
| Alertmanager | 0.28.1 | Current | 0.27 removed v1 API | v2 API only |
| Docker Compose | v2.x | Current | v1 syntax deprecated | Built into Docker CLI |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Self-hosted monitoring on VMs or bare metal with Docker | Running on Kubernetes | kube-prometheus-stack Helm chart |
| Need full control over retention, scraping, alerting | Want managed monitoring with zero ops | Grafana Cloud, Datadog, or AWS CloudWatch |
| Dev/staging environment monitoring | Monitoring 1000+ nodes | Thanos or Cortex for horizontal scaling |
| Docker Compose is already your deployment tool | Need log aggregation (not metrics) | Loki + Grafana or ELK stack |
| Budget-conscious -- all components free and open-source | Need APM/tracing | OpenTelemetry + Jaeger or commercial APM |
privileged: true and volume mounts give it broad host access -- in multi-tenant environments, evaluate the security risk