VPS Monitoring & Observability Series
    Part 4 of 5

    Grafana + Prometheus — Full Observability

    The complete metrics stack for when you outgrow lightweight tools. Custom dashboards, ad-hoc queries, and fleet-wide correlation.

    45 minutes
    Dedicated monitoring VPS recommended
    Prerequisites

    Dedicated RamNode VPS, Docker & Docker Compose

    Time to Complete

    40–50 minutes

    RAM Overhead

    400–600 MB for the full stack on a dedicated monitoring server

    The tools from Parts 1–3 serve most VPS operators well for a long time. But there is a category of problem that lightweight monitors cannot solve: you need to correlate metrics across dimensions you did not anticipate when you set up monitoring.

    This is the heavy end of the series. The full stack uses roughly 400–600 MB of RAM on a dedicated monitoring server. Do not run it on the same VPS as your production workloads.

    Stack Components

    ComponentRoleRAM usage
    PrometheusMetric scraping and storage200–400 MB
    GrafanaDashboards and visualization100–150 MB
    Node ExporterHost metrics (CPU, memory, disk, network)<10 MB per server
    cAdvisorDocker container metrics~50 MB per server
    AlertmanagerAlert routing and deduplication~20 MB

    Architecture

    Monitoring Server (dedicated RamNode VPS)
      - Prometheus (scrapes all exporters)
      - Grafana (reads from Prometheus)
      - Alertmanager (receives alerts from Prometheus, routes to channels)
    
    Each Production VPS
      - Node Exporter :9100 (host metrics)
      - cAdvisor :8080 (container metrics)

    Prometheus uses a pull model — it reaches out to each exporter on a schedule and scrapes the metrics endpoint. Each exporter needs to be network-reachable from the Prometheus server.

    Deploy Exporters on Each Production Server

    Step 1 — Exporter Compose file

    /opt/exporters/docker-compose.yml
    services:
      node-exporter:
        image: prom/node-exporter:latest
        container_name: node-exporter
        restart: unless-stopped
        command:
          - '--path.rootfs=/host'
          - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
        network_mode: host
        pid: host
        volumes:
          - '/:/host:ro,rslave'
    
      cadvisor:
        image: gcr.io/cadvisor/cadvisor:latest
        container_name: cadvisor
        restart: unless-stopped
        privileged: true
        ports:
          - "8080:8080"
        volumes:
          - /:/rootfs:ro
          - /var/run:/var/run:ro
          - /sys:/sys:ro
          - /var/lib/docker/:/var/lib/docker:ro
          - /dev/disk/:/dev/disk:ro
        devices:
          - /dev/kmsg
    cd /opt/exporters && docker compose up -d

    Step 2 — Restrict exporter access by IP

    # Allow only the monitoring server IP
    ufw allow from MONITORING_SERVER_IP to any port 9100
    ufw allow from MONITORING_SERVER_IP to any port 8080
    
    # Deny everyone else
    ufw deny 9100
    ufw deny 8080

    Verify Node Exporter is running:

    curl http://localhost:9100/metrics | head -20

    Deploy the Monitoring Stack

    Step 1 — Directory structure

    mkdir -p /opt/monitoring/{prometheus,grafana,alertmanager}
    cd /opt/monitoring

    Step 2 — Prometheus configuration

    /opt/monitoring/prometheus/prometheus.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - alertmanager:9093
    
    rule_files:
      - "alert-rules.yml"
    
    scrape_configs:
      - job_name: "prometheus"
        static_configs:
          - targets: ["localhost:9090"]
    
      - job_name: "node-exporter"
        static_configs:
          - targets:
              - "PRODUCTION_VPS_1_IP:9100"
              - "PRODUCTION_VPS_2_IP:9100"
              - "PRODUCTION_VPS_3_IP:9100"
            labels:
              environment: "production"
    
      - job_name: "cadvisor"
        static_configs:
          - targets:
              - "PRODUCTION_VPS_1_IP:8080"
              - "PRODUCTION_VPS_2_IP:8080"
              - "PRODUCTION_VPS_3_IP:8080"
            labels:
              environment: "production"

    Step 3 — Alert rules

    /opt/monitoring/prometheus/alert-rules.yml
    groups:
      - name: host-alerts
        rules:
          - alert: HighCPUUsage
            expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage on {{ $labels.instance }}"
    
          - alert: HighMemoryUsage
            expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High memory usage on {{ $labels.instance }}"
    
          - alert: DiskSpaceCritical
            expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 90
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Critical disk space on {{ $labels.instance }}"
    
          - alert: NodeDown
            expr: up{job="node-exporter"} == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Node exporter unreachable: {{ $labels.instance }}"

    Step 4 — Alertmanager configuration

    /opt/monitoring/alertmanager/alertmanager.yml
    global:
      resolve_timeout: 5m
    
    route:
      receiver: "discord-notifications"
      group_by: ["alertname", "instance"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
    
      routes:
        - match:
            severity: critical
          receiver: "discord-critical"
          repeat_interval: 1h
    
    receivers:
      - name: "discord-notifications"
        discord_configs:
          - webhook_url: "https://discord.com/api/webhooks/YOUR_WEBHOOK_URL"
    
      - name: "discord-critical"
        discord_configs:
          - webhook_url: "https://discord.com/api/webhooks/YOUR_CRITICAL_WEBHOOK_URL"
    
    inhibit_rules:
      - source_match:
          severity: "critical"
        target_match:
          severity: "warning"
        equal: ["instance"]

    Step 5 — Full stack Compose file

    /opt/monitoring/docker-compose.yml
    services:
      prometheus:
        image: prom/prometheus:latest
        container_name: prometheus
        restart: unless-stopped
        ports:
          - "127.0.0.1:9090:9090"
        command:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=30d'
          - '--web.enable-lifecycle'
        volumes:
          - ./prometheus:/etc/prometheus
          - prometheus-data:/prometheus
        networks:
          - monitoring
    
      grafana:
        image: grafana/grafana:latest
        container_name: grafana
        restart: unless-stopped
        ports:
          - "127.0.0.1:3000:3000"
        environment:
          GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
          GF_USERS_ALLOW_SIGN_UP: "false"
          GF_SERVER_DOMAIN: "grafana.yourdomain.com"
          GF_SERVER_ROOT_URL: "https://grafana.yourdomain.com"
        volumes:
          - grafana-data:/var/lib/grafana
        networks:
          - monitoring
        depends_on:
          - prometheus
    
      alertmanager:
        image: prom/alertmanager:latest
        container_name: alertmanager
        restart: unless-stopped
        ports:
          - "127.0.0.1:9093:9093"
        command:
          - '--config.file=/etc/alertmanager/alertmanager.yml'
          - '--storage.path=/alertmanager'
        volumes:
          - ./alertmanager:/etc/alertmanager
          - alertmanager-data:/alertmanager
        networks:
          - monitoring
    
    volumes:
      prometheus-data:
      grafana-data:
      alertmanager-data:
    
    networks:
      monitoring:
        driver: bridge

    All services are bound to 127.0.0.1 only — access goes through your reverse proxy.

    echo "GRAFANA_PASSWORD=$(openssl rand -base64 24)" > /opt/monitoring/.env
    cd /opt/monitoring && docker compose up -d

    Grafana Setup

    Add Prometheus as a data source

    1. Log into Grafana (password from your .env file)
    2. Go to Connections > Data Sources > Add data source
    3. Select Prometheus
    4. URL: http://prometheus:9090
    5. Click Save & test

    Import community dashboards

    • Node Exporter Full (ID: 1860) — Standard host metrics dashboard with CPU, memory, disk, network
    • Docker and cAdvisor (ID: 893) — Per-container CPU, memory, and network usage
    • Node Exporter for Prometheus (ID: 11074) — Cleaner, more modern layout

    Import via Dashboards > Import, enter the ID, select your Prometheus data source.

    Build a custom fleet overview dashboard

    Create a new dashboard and add panels with these PromQL queries:

    Memory usage across all instances
    100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
    CPU usage across all instances
    100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    Disk usage (root filesystem)
    100 * (1 - (node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}))

    PromQL Basics

    A few patterns cover most use cases:

    Rate of increase over time

    rate(metric_name[5m])

    Aggregation across instances

    avg by(instance) (metric)
    sum by(instance) (metric)
    max by(instance) (metric)

    Filter by label

    metric{label="value"}
    metric{label!="value"}
    metric{label=~"regex"}

    Combine metrics

    # Total network I/O across all interfaces, all instances
    sum by(instance) (rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m]))

    Retention & Storage Planning

    Rough estimate: each Node Exporter target at 15-second scrape intervals uses approximately 50–100 MB of Prometheus storage per month.

    5 servers × ~75 MB/month = ~375 MB

    Adjust retention via the --storage.tsdb.retention.time flag. Common values: 15d, 30d, 90d, 1y.

    du -sh /var/lib/docker/volumes/monitoring_prometheus-data

    When to Use This vs. Lighter Tools

    Stick with Beszel + Uptime Kuma/Gatus

    • • Running 1–5 servers with standard workloads
    • • Need to know when something is broken, not why
    • • Total monitoring budget is a single low-cost VPS
    • • Don't need custom dashboards or ad-hoc queries

    Add Prometheus + Grafana

    • • Debugging performance problems requiring metric correlation
    • • Need application-level metrics from your own code
    • • Team needs shared dashboards for postmortems
    • • Running 10+ servers and need trend analysis

    There is no reason you cannot run both. Many operators run Beszel permanently and spin up the full Prometheus stack only when debugging specific issues.

    What's Next

    You now have four monitoring tools configured. All of them can send alerts, but alerts are only useful if they reach the right people at the right time.

    Part 5 covers:

    • PagerDuty escalation policies for on-call routing
    • Alertmanager grouping and inhibition rules
    • Alert fatigue prevention strategies
    • Runbooks and lightweight incident response process