Grafana + Prometheus — Full Observability
The complete metrics stack for when you outgrow lightweight tools. Custom dashboards, ad-hoc queries, and fleet-wide correlation.
Dedicated RamNode VPS, Docker & Docker Compose
40–50 minutes
400–600 MB for the full stack on a dedicated monitoring server
The tools from Parts 1–3 serve most VPS operators well for a long time. But there is a category of problem that lightweight monitors cannot solve: you need to correlate metrics across dimensions you did not anticipate when you set up monitoring.
This is the heavy end of the series. The full stack uses roughly 400–600 MB of RAM on a dedicated monitoring server. Do not run it on the same VPS as your production workloads.
Stack Components
| Component | Role | RAM usage |
|---|---|---|
| Prometheus | Metric scraping and storage | 200–400 MB |
| Grafana | Dashboards and visualization | 100–150 MB |
| Node Exporter | Host metrics (CPU, memory, disk, network) | <10 MB per server |
| cAdvisor | Docker container metrics | ~50 MB per server |
| Alertmanager | Alert routing and deduplication | ~20 MB |
Architecture
Monitoring Server (dedicated RamNode VPS)
- Prometheus (scrapes all exporters)
- Grafana (reads from Prometheus)
- Alertmanager (receives alerts from Prometheus, routes to channels)
Each Production VPS
- Node Exporter :9100 (host metrics)
- cAdvisor :8080 (container metrics)Prometheus uses a pull model — it reaches out to each exporter on a schedule and scrapes the metrics endpoint. Each exporter needs to be network-reachable from the Prometheus server.
Deploy Exporters on Each Production Server
Step 1 — Exporter Compose file
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
command:
- '--path.rootfs=/host'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
network_mode: host
pid: host
volumes:
- '/:/host:ro,rslave'
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
privileged: true
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsgcd /opt/exporters && docker compose up -dStep 2 — Restrict exporter access by IP
# Allow only the monitoring server IP
ufw allow from MONITORING_SERVER_IP to any port 9100
ufw allow from MONITORING_SERVER_IP to any port 8080
# Deny everyone else
ufw deny 9100
ufw deny 8080Verify Node Exporter is running:
curl http://localhost:9100/metrics | head -20Deploy the Monitoring Stack
Step 1 — Directory structure
mkdir -p /opt/monitoring/{prometheus,grafana,alertmanager}
cd /opt/monitoringStep 2 — Prometheus configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alert-rules.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets:
- "PRODUCTION_VPS_1_IP:9100"
- "PRODUCTION_VPS_2_IP:9100"
- "PRODUCTION_VPS_3_IP:9100"
labels:
environment: "production"
- job_name: "cadvisor"
static_configs:
- targets:
- "PRODUCTION_VPS_1_IP:8080"
- "PRODUCTION_VPS_2_IP:8080"
- "PRODUCTION_VPS_3_IP:8080"
labels:
environment: "production"Step 3 — Alert rules
groups:
- name: host-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceCritical
expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 90
for: 1m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node exporter unreachable: {{ $labels.instance }}"Step 4 — Alertmanager configuration
global:
resolve_timeout: 5m
route:
receiver: "discord-notifications"
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "discord-critical"
repeat_interval: 1h
receivers:
- name: "discord-notifications"
discord_configs:
- webhook_url: "https://discord.com/api/webhooks/YOUR_WEBHOOK_URL"
- name: "discord-critical"
discord_configs:
- webhook_url: "https://discord.com/api/webhooks/YOUR_CRITICAL_WEBHOOK_URL"
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["instance"]Step 5 — Full stack Compose file
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "127.0.0.1:9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
volumes:
- ./prometheus:/etc/prometheus
- prometheus-data:/prometheus
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "127.0.0.1:3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
GF_USERS_ALLOW_SIGN_UP: "false"
GF_SERVER_DOMAIN: "grafana.yourdomain.com"
GF_SERVER_ROOT_URL: "https://grafana.yourdomain.com"
volumes:
- grafana-data:/var/lib/grafana
networks:
- monitoring
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
ports:
- "127.0.0.1:9093:9093"
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
volumes:
- ./alertmanager:/etc/alertmanager
- alertmanager-data:/alertmanager
networks:
- monitoring
volumes:
prometheus-data:
grafana-data:
alertmanager-data:
networks:
monitoring:
driver: bridgeAll services are bound to 127.0.0.1 only — access goes through your reverse proxy.
echo "GRAFANA_PASSWORD=$(openssl rand -base64 24)" > /opt/monitoring/.env
cd /opt/monitoring && docker compose up -dGrafana Setup
Add Prometheus as a data source
- Log into Grafana (password from your
.envfile) - Go to Connections > Data Sources > Add data source
- Select Prometheus
- URL:
http://prometheus:9090 - Click Save & test
Import community dashboards
- • Node Exporter Full (ID: 1860) — Standard host metrics dashboard with CPU, memory, disk, network
- • Docker and cAdvisor (ID: 893) — Per-container CPU, memory, and network usage
- • Node Exporter for Prometheus (ID: 11074) — Cleaner, more modern layout
Import via Dashboards > Import, enter the ID, select your Prometheus data source.
Build a custom fleet overview dashboard
Create a new dashboard and add panels with these PromQL queries:
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)100 * (1 - (node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}))PromQL Basics
A few patterns cover most use cases:
Rate of increase over time
rate(metric_name[5m])Aggregation across instances
avg by(instance) (metric)
sum by(instance) (metric)
max by(instance) (metric)Filter by label
metric{label="value"}
metric{label!="value"}
metric{label=~"regex"}Combine metrics
# Total network I/O across all interfaces, all instances
sum by(instance) (rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m]))Retention & Storage Planning
Rough estimate: each Node Exporter target at 15-second scrape intervals uses approximately 50–100 MB of Prometheus storage per month.
5 servers × ~75 MB/month = ~375 MBAdjust retention via the --storage.tsdb.retention.time flag. Common values: 15d, 30d, 90d, 1y.
du -sh /var/lib/docker/volumes/monitoring_prometheus-dataWhen to Use This vs. Lighter Tools
Stick with Beszel + Uptime Kuma/Gatus
- • Running 1–5 servers with standard workloads
- • Need to know when something is broken, not why
- • Total monitoring budget is a single low-cost VPS
- • Don't need custom dashboards or ad-hoc queries
Add Prometheus + Grafana
- • Debugging performance problems requiring metric correlation
- • Need application-level metrics from your own code
- • Team needs shared dashboards for postmortems
- • Running 10+ servers and need trend analysis
There is no reason you cannot run both. Many operators run Beszel permanently and spin up the full Prometheus stack only when debugging specific issues.
What's Next
You now have four monitoring tools configured. All of them can send alerts, but alerts are only useful if they reach the right people at the right time.
Part 5 covers:
- PagerDuty escalation policies for on-call routing
- Alertmanager grouping and inhibition rules
- Alert fatigue prevention strategies
- Runbooks and lightweight incident response process
