Part 4 of 5

Grafana + Prometheus — Full Observability

The complete metrics stack for when you outgrow lightweight tools. Custom dashboards, ad-hoc queries, and fleet-wide correlation.

45 minutes

Dedicated monitoring VPS recommended

Prerequisites

Dedicated RamNode VPS, Docker & Docker Compose

Time to Complete

40–50 minutes

RAM Overhead

400–600 MB for the full stack on a dedicated monitoring server

The tools from Parts 1–3 serve most VPS operators well for a long time. But there is a category of problem that lightweight monitors cannot solve: you need to correlate metrics across dimensions you did not anticipate when you set up monitoring.

This is the heavy end of the series. The full stack uses roughly 400–600 MB of RAM on a dedicated monitoring server. Do not run it on the same VPS as your production workloads.

Stack Components

Component	Role	RAM usage
Prometheus	Metric scraping and storage	200–400 MB
Grafana	Dashboards and visualization	100–150 MB
Node Exporter	Host metrics (CPU, memory, disk, network)	<10 MB per server
cAdvisor	Docker container metrics	~50 MB per server
Alertmanager	Alert routing and deduplication	~20 MB

Architecture

Monitoring Server (dedicated RamNode VPS)
  - Prometheus (scrapes all exporters)
  - Grafana (reads from Prometheus)
  - Alertmanager (receives alerts from Prometheus, routes to channels)

Each Production VPS
  - Node Exporter :9100 (host metrics)
  - cAdvisor :8080 (container metrics)

Prometheus uses a pull model — it reaches out to each exporter on a schedule and scrapes the metrics endpoint. Each exporter needs to be network-reachable from the Prometheus server.

Deploy Exporters on Each Production Server

Step 1 — Exporter Compose file

/opt/exporters/docker-compose.yml

services:
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    command:
      - '--path.rootfs=/host'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
    network_mode: host
    pid: host
    volumes:
      - '/:/host:ro,rslave'

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    devices:
      - /dev/kmsg

cd /opt/exporters && docker compose up -d

Step 2 — Restrict exporter access by IP

# Allow only the monitoring server IP
ufw allow from MONITORING_SERVER_IP to any port 9100
ufw allow from MONITORING_SERVER_IP to any port 8080

# Deny everyone else
ufw deny 9100
ufw deny 8080

Verify Node Exporter is running:

curl http://localhost:9100/metrics | head -20

Deploy the Monitoring Stack

Step 1 — Directory structure

mkdir -p /opt/monitoring/{prometheus,grafana,alertmanager}
cd /opt/monitoring

Step 2 — Prometheus configuration

/opt/monitoring/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "alert-rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "PRODUCTION_VPS_1_IP:9100"
          - "PRODUCTION_VPS_2_IP:9100"
          - "PRODUCTION_VPS_3_IP:9100"
        labels:
          environment: "production"

  - job_name: "cadvisor"
    static_configs:
      - targets:
          - "PRODUCTION_VPS_1_IP:8080"
          - "PRODUCTION_VPS_2_IP:8080"
          - "PRODUCTION_VPS_3_IP:8080"
        labels:
          environment: "production"

Step 3 — Alert rules

/opt/monitoring/prometheus/alert-rules.yml

groups:
  - name: host-alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: DiskSpaceCritical
        expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 90
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"

      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node exporter unreachable: {{ $labels.instance }}"

Step 4 — Alertmanager configuration

/opt/monitoring/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m

route:
  receiver: "discord-notifications"
  group_by: ["alertname", "instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: "discord-critical"
      repeat_interval: 1h

receivers:
  - name: "discord-notifications"
    discord_configs:
      - webhook_url: "https://discord.com/api/webhooks/YOUR_WEBHOOK_URL"

  - name: "discord-critical"
    discord_configs:
      - webhook_url: "https://discord.com/api/webhooks/YOUR_CRITICAL_WEBHOOK_URL"

inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["instance"]

Step 5 — Full stack Compose file

/opt/monitoring/docker-compose.yml

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "127.0.0.1:9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus-data:/prometheus
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "127.0.0.1:3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_DOMAIN: "grafana.yourdomain.com"
      GF_SERVER_ROOT_URL: "https://grafana.yourdomain.com"
    volumes:
      - grafana-data:/var/lib/grafana
    networks:
      - monitoring
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "127.0.0.1:9093:9093"
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    volumes:
      - ./alertmanager:/etc/alertmanager
      - alertmanager-data:/alertmanager
    networks:
      - monitoring

volumes:
  prometheus-data:
  grafana-data:
  alertmanager-data:

networks:
  monitoring:
    driver: bridge

All services are bound to 127.0.0.1 only — access goes through your reverse proxy.

echo "GRAFANA_PASSWORD=$(openssl rand -base64 24)" > /opt/monitoring/.env
cd /opt/monitoring && docker compose up -d

Grafana Setup

Add Prometheus as a data source

Log into Grafana (password from your .env file)
Go to Connections > Data Sources > Add data source
Select Prometheus
URL: http://prometheus:9090
Click Save & test

Import community dashboards

• Node Exporter Full (ID: 1860) — Standard host metrics dashboard with CPU, memory, disk, network
• Docker and cAdvisor (ID: 893) — Per-container CPU, memory, and network usage
• Node Exporter for Prometheus (ID: 11074) — Cleaner, more modern layout

Import via Dashboards > Import, enter the ID, select your Prometheus data source.

Build a custom fleet overview dashboard

Create a new dashboard and add panels with these PromQL queries:

Memory usage across all instances

100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

CPU usage across all instances

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Disk usage (root filesystem)

100 * (1 - (node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}))

PromQL Basics

A few patterns cover most use cases:

Rate of increase over time

rate(metric_name[5m])

Aggregation across instances

avg by(instance) (metric)
sum by(instance) (metric)
max by(instance) (metric)

Filter by label

metric{label="value"}
metric{label!="value"}
metric{label=~"regex"}

Combine metrics

# Total network I/O across all interfaces, all instances
sum by(instance) (rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m]))

Retention & Storage Planning

Rough estimate: each Node Exporter target at 15-second scrape intervals uses approximately 50–100 MB of Prometheus storage per month.

5 servers × ~75 MB/month = ~375 MB

Adjust retention via the --storage.tsdb.retention.time flag. Common values: 15d, 30d, 90d, 1y.

du -sh /var/lib/docker/volumes/monitoring_prometheus-data

When to Use This vs. Lighter Tools

Stick with Beszel + Uptime Kuma/Gatus

• Running 1–5 servers with standard workloads
• Need to know when something is broken, not why
• Total monitoring budget is a single low-cost VPS
• Don't need custom dashboards or ad-hoc queries

Add Prometheus + Grafana

• Debugging performance problems requiring metric correlation
• Need application-level metrics from your own code
• Team needs shared dashboards for postmortems
• Running 10+ servers and need trend analysis

There is no reason you cannot run both. Many operators run Beszel permanently and spin up the full Prometheus stack only when debugging specific issues.

What's Next

You now have four monitoring tools configured. All of them can send alerts, but alerts are only useful if they reach the right people at the right time.

Part 5 covers:

PagerDuty escalation policies for on-call routing
Alertmanager grouping and inhibition rules
Alert fatigue prevention strategies
Runbooks and lightweight incident response process

Part 3: Gatus Part 5: Alerting