VPS Monitoring & Observability Series
    Part 5 of 5

    Alerting & Incident Response

    Intelligent alert routing, fatigue prevention, PagerDuty escalation, runbooks, and a lightweight incident process that actually works.

    40 minutes
    Applies to all monitoring tools
    Prerequisites

    Any monitoring tools from Parts 1–4

    Time to Complete

    35–45 minutes

    Key Tools

    PagerDuty (free plan), Alertmanager, Slack/Discord

    By this point in the series you have monitoring. What you might not have is a coherent plan for what happens when something goes wrong. Alerts firing into a Discord channel where they get lost among team chatter, duplicate notifications from three different tools about the same incident, and no documented steps for what to actually do when the database goes down.

    This final guide covers the alerting and incident response layer that sits on top of all the tools from Parts 1–4.

    The Problem With Naive Alerting

    • • Every tool sends to the same Slack channel
    • • When one server has a bad night, you get 40 notifications about correlated issues
    • • Alerts fire without context about what to do
    • • Nobody knows whose job it is to respond at 3 AM
    • • Resolved notifications generate as much noise as firing ones

    Alert fatigue is real. When alerts become background noise, critical ones get missed. The goal is to make every alert meaningful and actionable.

    Alert Taxonomy

    Critical

    Requires immediate human response, any time of day or night. Examples: production server completely down, database unreachable, SSL cert expired.

    Warning

    Requires attention within a few hours during business hours. Examples: disk at 80%, memory sustained above 85%, response times degrading.

    Info

    Worth knowing but no action required immediately. Examples: unusual traffic spike, a service restarted successfully, a maintenance window started.

    Notification Channels and Their Uses

    ChannelBest forAvoid for
    PagerDuty / OpsgenieCritical alerts, guaranteed deliveryNon-critical noise
    SMSCritical fallbackGeneral monitoring
    Slack / DiscordWarning/info, team visibilityCritical alerts (not reliable enough alone)
    EmailInfo-level, weekly digestsCritical alerts (too slow)
    ntfy / PushoverPersonal setups, mobile pushTeam routing

    Key insight: Do not rely on Slack or Discord for critical alerts. These platforms have outages, apps get muted, and notifications get buried. Use PagerDuty for anything that needs a guaranteed response.

    PagerDuty Setup

    PagerDuty's free Developer plan covers 5 users with full on-call scheduling, escalation policies, and multi-channel notifications.

    Escalation policy

    1. Level 1 — Primary on-call: notify immediately via phone + SMS
    2. Escalate after — 15 minutes if not acknowledged
    3. Level 2 — Secondary on-call or backup contact
    4. Repeat policy — 2–3 times before acknowledging as no-response

    Connecting to each monitoring tool

    Alertmanager:

    receivers:
      - name: "pagerduty-critical"
        pagerduty_configs:
          - routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
            severity: "critical"
            details:
              instance: "{{ .GroupLabels.instance }}"

    Uptime Kuma: Settings > Notifications > PagerDuty > paste integration key.
    Gatus: Add pagerduty section under alerting in config.yaml.
    Beszel: Settings > Notifications > PagerDuty > paste integration key.

    Slack Integration

    Create dedicated alert channels:

    • #infra-critical — Critical alerts (mirrored from PagerDuty)
    • #infra-warnings — Warning level alerts
    • #infra-resolved — Resolved notifications (separate to reduce noise)

    Alertmanager routing with Slack

    route:
      receiver: "slack-warnings"
      group_by: ["alertname", "instance"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
    
      routes:
        - match:
            severity: critical
          receiver: "pagerduty-and-slack-critical"
          repeat_interval: 1h
    
        - match:
            severity: warning
          receiver: "slack-warnings"
          repeat_interval: 4h
    
    receivers:
      - name: "pagerduty-and-slack-critical"
        pagerduty_configs:
          - routing_key: "${PAGERDUTY_INTEGRATION_KEY}"
        slack_configs:
          - api_url: "https://hooks.slack.com/services/CRITICAL_WEBHOOK_URL"
            channel: "#infra-critical"
            title: "CRITICAL: {{ .GroupLabels.alertname }}"
            send_resolved: true
    
      - name: "slack-warnings"
        slack_configs:
          - channel: "#infra-warnings"
            title: "{{ .GroupLabels.alertname }}"
            send_resolved: true

    Good alert messages include

    • • What is broken (service name, instance)
    • • Why it triggered (the metric value that crossed the threshold)
    • • A link to the relevant Grafana dashboard
    • • A link to the runbook for this alert type

    Discord Integration

    Works similarly to Slack. Create separate webhooks for different channels/severity levels:

    receivers:
      - name: "discord-warnings"
        discord_configs:
          - webhook_url: "${DISCORD_WEBHOOK_WARNING}"
            title: "{{ .GroupLabels.alertname }}"
    
      - name: "discord-critical"
        discord_configs:
          - webhook_url: "${DISCORD_WEBHOOK_CRITICAL}"
            title: "CRITICAL: {{ .GroupLabels.alertname }}"
            message: |
              @here
              {{ range .Alerts }}
              **{{ .Annotations.summary }}**
              {{ end }}

    Email Alerting

    Best for weekly summaries, info-level notifications, and compliance audit trails.

    global:
      smtp_smarthost: 'smtp.mailgun.org:587'
      smtp_from: 'alerts@yourdomain.com'
      smtp_auth_username: 'alerts@yourdomain.com'
      smtp_auth_password: "${SMTP_PASSWORD}"
      smtp_require_tls: true
    
    receivers:
      - name: "email-weekly-digest"
        email_configs:
          - to: "ops@yourdomain.com"
            send_resolved: true

    Preventing Alert Fatigue

    Deduplicate with grouping

    route:
      group_by: ["instance", "alertname"]
      group_wait: 30s       # Wait 30s to collect related alerts before sending
      group_interval: 5m    # Send updates for the same group every 5 minutes
      repeat_interval: 4h   # Resend a still-firing alert every 4 hours

    Use inhibition rules

    If an instance is down (critical), suppress all warning-level alerts for that same instance:

    inhibit_rules:
      - source_match:
          alertname: NodeDown
        target_match_re:
          alertname: "HighCPU|HighMemory|DiskSpaceLow"
        equal: ["instance"]

    Set appropriate time windows

    A CPU spike that lasts 30 seconds is normal. Use the for clause:

    - alert: HighCPUUsage
      expr: cpu_usage > 85
      for: 5m   # Must be true for 5 minutes before alerting

    Separate resolved notifications

    Route resolved notifications to a separate #infra-resolved channel to keep firing alerts readable.

    Runbooks

    A runbook answers: "this alert fired — what do I do?" They do not need to be long. They need to be fast to execute under stress.

    Example: HighDiskUsage Runbook
    # Alert: HighDiskUsage
    
    ## What this means
    Disk usage on the affected instance has exceeded 80%.
    
    ## Immediate triage (< 5 minutes)
    1. SSH into the affected server: ssh deploy@INSTANCE_IP
    2. Check which filesystem is affected: df -h
    3. Find the largest directories: du -sh /* 2>/dev/null | sort -rh | head -20
    
    ## Common causes and fixes
    
    ### Docker logs consuming space
    docker system prune -f --volumes
    journalctl --vacuum-size=500M
    
    ### Application logs
    find /var/log -name "*.log" -size +100M
    truncate -s 0 /var/log/problematic.log
    
    ## Escalate if
    - Disk is over 95% and you cannot free space within 15 minutes
    - The filling directory is an application data directory you do not recognize

    Keep runbooks in a Git repository. Link directly from your alert messages.

    Incident Response Process

    1. Detect

    An alert fires. PagerDuty pages the on-call person.

    2. Acknowledge

    On-call engineer acknowledges within your SLA (common: 15 minutes for critical). Stops escalation.

    3. Investigate

    Follow the runbook. Use Grafana to correlate metrics. Determine root cause vs. symptom.

    4. Resolve

    Take corrective action. Confirm in monitoring that metrics return to normal.

    5. Postmortem

    For anything with user impact: what happened, what the impact was, what you changed to prevent recurrence.

    Incident communication template

    [INVESTIGATING] We are investigating reports of [service] being unavailable.
    We will update this page every 30 minutes. Started: 14:32 UTC.
    
    [UPDATE 15:00 UTC] Root cause identified as [brief description].
    Fix in progress, estimated resolution: 15:45 UTC.
    
    [RESOLVED 15:42 UTC] Service has been restored. Root cause: [brief description].
    A full postmortem will be posted within 48 hours.

    Recommended Alert Routing Map

    Alert fires
      │
      ├─► severity: critical
      │     ├─► PagerDuty (guaranteed delivery, on-call escalation)
      │     └─► #infra-critical (Slack/Discord - team visibility)
      │
      ├─► severity: warning
      │     └─► #infra-warnings (Slack/Discord - monitored during business hours)
      │
      └─► severity: info
            └─► Email digest (weekly, low priority) or suppressed entirely
    
    Alert resolves
      └─► #infra-resolved (separate channel, all severities)

    Store all webhook URLs and API keys in environment variables. Use Alertmanager as the single routing hub rather than configuring channels separately in every monitoring tool.

    Series Summary

    • Part 1 — Beszel — Hub-and-agent host metrics for your entire fleet at ~23 MB RAM per agent
    • Part 2 — Uptime Kuma — External endpoint monitoring and public status pages
    • Part 3 — Gatus — GitOps-friendly health checks in version-controlled YAML
    • Part 4 — Grafana + Prometheus — Full observability with ad-hoc querying and custom dashboards
    • Part 5 — Alerting — Coherent routing, deduplication, runbooks, and incident process

    You do not need all of it at once. Most RamNode customers running a handful of servers will get 90% of the value from just Parts 1 and 2. The monitoring stack should cost you roughly $4–8/month in additional RamNode VPS capacity — one low-cost instance for Beszel + Uptime Kuma/Gatus, and a slightly larger instance if you add the Prometheus stack.