Part 5 of 5

Alerting & Incident Response

Intelligent alert routing, fatigue prevention, PagerDuty escalation, runbooks, and a lightweight incident process that actually works.

40 minutes

Applies to all monitoring tools

Prerequisites

Any monitoring tools from Parts 1–4

Time to Complete

35–45 minutes

Key Tools

PagerDuty (free plan), Alertmanager, Slack/Discord

By this point in the series you have monitoring. What you might not have is a coherent plan for what happens when something goes wrong. Alerts firing into a Discord channel where they get lost among team chatter, duplicate notifications from three different tools about the same incident, and no documented steps for what to actually do when the database goes down.

This final guide covers the alerting and incident response layer that sits on top of all the tools from Parts 1–4.

The Problem With Naive Alerting

• Every tool sends to the same Slack channel
• When one server has a bad night, you get 40 notifications about correlated issues
• Alerts fire without context about what to do
• Nobody knows whose job it is to respond at 3 AM
• Resolved notifications generate as much noise as firing ones

Alert fatigue is real. When alerts become background noise, critical ones get missed. The goal is to make every alert meaningful and actionable.

Alert Taxonomy

Critical

Requires immediate human response, any time of day or night. Examples: production server completely down, database unreachable, SSL cert expired.

Warning

Requires attention within a few hours during business hours. Examples: disk at 80%, memory sustained above 85%, response times degrading.

Info

Worth knowing but no action required immediately. Examples: unusual traffic spike, a service restarted successfully, a maintenance window started.

Notification Channels and Their Uses

Channel	Best for	Avoid for
PagerDuty / Opsgenie	Critical alerts, guaranteed delivery	Non-critical noise
SMS	Critical fallback	General monitoring
Slack / Discord	Warning/info, team visibility	Critical alerts (not reliable enough alone)
Email	Info-level, weekly digests	Critical alerts (too slow)
ntfy / Pushover	Personal setups, mobile push	Team routing

Key insight: Do not rely on Slack or Discord for critical alerts. These platforms have outages, apps get muted, and notifications get buried. Use PagerDuty for anything that needs a guaranteed response.

PagerDuty Setup

PagerDuty's free Developer plan covers 5 users with full on-call scheduling, escalation policies, and multi-channel notifications.

Escalation policy

Level 1 — Primary on-call: notify immediately via phone + SMS
Escalate after — 15 minutes if not acknowledged
Level 2 — Secondary on-call or backup contact
Repeat policy — 2–3 times before acknowledging as no-response

Connecting to each monitoring tool

Alertmanager:

receivers:
  - name: "pagerduty-critical"
    pagerduty_configs:
      - routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
        severity: "critical"
        details:
          instance: "{{ .GroupLabels.instance }}"

Uptime Kuma: Settings > Notifications > PagerDuty > paste integration key.
Gatus: Add pagerduty section under alerting in config.yaml.
Beszel: Settings > Notifications > PagerDuty > paste integration key.

Slack Integration

Create dedicated alert channels:

• #infra-critical — Critical alerts (mirrored from PagerDuty)
• #infra-warnings — Warning level alerts
• #infra-resolved — Resolved notifications (separate to reduce noise)

Alertmanager routing with Slack

route:
  receiver: "slack-warnings"
  group_by: ["alertname", "instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: "pagerduty-and-slack-critical"
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: "slack-warnings"
      repeat_interval: 4h

receivers:
  - name: "pagerduty-and-slack-critical"
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_INTEGRATION_KEY}"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/CRITICAL_WEBHOOK_URL"
        channel: "#infra-critical"
        title: "CRITICAL: {{ .GroupLabels.alertname }}"
        send_resolved: true

  - name: "slack-warnings"
    slack_configs:
      - channel: "#infra-warnings"
        title: "{{ .GroupLabels.alertname }}"
        send_resolved: true

Good alert messages include

• What is broken (service name, instance)
• Why it triggered (the metric value that crossed the threshold)
• A link to the relevant Grafana dashboard
• A link to the runbook for this alert type

Discord Integration

Works similarly to Slack. Create separate webhooks for different channels/severity levels:

receivers:
  - name: "discord-warnings"
    discord_configs:
      - webhook_url: "${DISCORD_WEBHOOK_WARNING}"
        title: "{{ .GroupLabels.alertname }}"

  - name: "discord-critical"
    discord_configs:
      - webhook_url: "${DISCORD_WEBHOOK_CRITICAL}"
        title: "CRITICAL: {{ .GroupLabels.alertname }}"
        message: |
          @here
          {{ range .Alerts }}
          **{{ .Annotations.summary }}**
          {{ end }}

Email Alerting

Best for weekly summaries, info-level notifications, and compliance audit trails.

global:
  smtp_smarthost: 'smtp.mailgun.org:587'
  smtp_from: 'alerts@yourdomain.com'
  smtp_auth_username: 'alerts@yourdomain.com'
  smtp_auth_password: "${SMTP_PASSWORD}"
  smtp_require_tls: true

receivers:
  - name: "email-weekly-digest"
    email_configs:
      - to: "ops@yourdomain.com"
        send_resolved: true

Preventing Alert Fatigue

Deduplicate with grouping

route:
  group_by: ["instance", "alertname"]
  group_wait: 30s       # Wait 30s to collect related alerts before sending
  group_interval: 5m    # Send updates for the same group every 5 minutes
  repeat_interval: 4h   # Resend a still-firing alert every 4 hours

Use inhibition rules

If an instance is down (critical), suppress all warning-level alerts for that same instance:

inhibit_rules:
  - source_match:
      alertname: NodeDown
    target_match_re:
      alertname: "HighCPU|HighMemory|DiskSpaceLow"
    equal: ["instance"]

Set appropriate time windows

A CPU spike that lasts 30 seconds is normal. Use the for clause:

- alert: HighCPUUsage
  expr: cpu_usage > 85
  for: 5m   # Must be true for 5 minutes before alerting

Separate resolved notifications

Route resolved notifications to a separate #infra-resolved channel to keep firing alerts readable.

Runbooks

A runbook answers: "this alert fired — what do I do?" They do not need to be long. They need to be fast to execute under stress.

Example: HighDiskUsage Runbook

# Alert: HighDiskUsage

## What this means
Disk usage on the affected instance has exceeded 80%.

## Immediate triage (< 5 minutes)
1. SSH into the affected server: ssh deploy@INSTANCE_IP
2. Check which filesystem is affected: df -h
3. Find the largest directories: du -sh /* 2>/dev/null | sort -rh | head -20

## Common causes and fixes

### Docker logs consuming space
docker system prune -f --volumes
journalctl --vacuum-size=500M

### Application logs
find /var/log -name "*.log" -size +100M
truncate -s 0 /var/log/problematic.log

## Escalate if
- Disk is over 95% and you cannot free space within 15 minutes
- The filling directory is an application data directory you do not recognize

Keep runbooks in a Git repository. Link directly from your alert messages.

Incident Response Process

1. Detect

An alert fires. PagerDuty pages the on-call person.

2. Acknowledge

On-call engineer acknowledges within your SLA (common: 15 minutes for critical). Stops escalation.

3. Investigate

Follow the runbook. Use Grafana to correlate metrics. Determine root cause vs. symptom.

4. Resolve

Take corrective action. Confirm in monitoring that metrics return to normal.

5. Postmortem

For anything with user impact: what happened, what the impact was, what you changed to prevent recurrence.

Incident communication template

[INVESTIGATING] We are investigating reports of [service] being unavailable.
We will update this page every 30 minutes. Started: 14:32 UTC.

[UPDATE 15:00 UTC] Root cause identified as [brief description].
Fix in progress, estimated resolution: 15:45 UTC.

[RESOLVED 15:42 UTC] Service has been restored. Root cause: [brief description].
A full postmortem will be posted within 48 hours.

Recommended Alert Routing Map

Alert fires
  │
  ├─► severity: critical
  │     ├─► PagerDuty (guaranteed delivery, on-call escalation)
  │     └─► #infra-critical (Slack/Discord - team visibility)
  │
  ├─► severity: warning
  │     └─► #infra-warnings (Slack/Discord - monitored during business hours)
  │
  └─► severity: info
        └─► Email digest (weekly, low priority) or suppressed entirely

Alert resolves
  └─► #infra-resolved (separate channel, all severities)

Store all webhook URLs and API keys in environment variables. Use Alertmanager as the single routing hub rather than configuring channels separately in every monitoring tool.

Series Summary

Part 1 — Beszel — Hub-and-agent host metrics for your entire fleet at ~23 MB RAM per agent
Part 2 — Uptime Kuma — External endpoint monitoring and public status pages
Part 3 — Gatus — GitOps-friendly health checks in version-controlled YAML
Part 4 — Grafana + Prometheus — Full observability with ad-hoc querying and custom dashboards
Part 5 — Alerting — Coherent routing, deduplication, runbooks, and incident process

You do not need all of it at once. Most RamNode customers running a handful of servers will get 90% of the value from just Parts 1 and 2. The monitoring stack should cost you roughly $4–8/month in additional RamNode VPS capacity — one low-cost instance for Beszel + Uptime Kuma/Gatus, and a slightly larger instance if you add the Prometheus stack.

Part 4: Grafana + Prometheus Back to Series Overview