Alerting & Incident Response
Intelligent alert routing, fatigue prevention, PagerDuty escalation, runbooks, and a lightweight incident process that actually works.
Any monitoring tools from Parts 1–4
35–45 minutes
PagerDuty (free plan), Alertmanager, Slack/Discord
By this point in the series you have monitoring. What you might not have is a coherent plan for what happens when something goes wrong. Alerts firing into a Discord channel where they get lost among team chatter, duplicate notifications from three different tools about the same incident, and no documented steps for what to actually do when the database goes down.
This final guide covers the alerting and incident response layer that sits on top of all the tools from Parts 1–4.
The Problem With Naive Alerting
- • Every tool sends to the same Slack channel
- • When one server has a bad night, you get 40 notifications about correlated issues
- • Alerts fire without context about what to do
- • Nobody knows whose job it is to respond at 3 AM
- • Resolved notifications generate as much noise as firing ones
Alert fatigue is real. When alerts become background noise, critical ones get missed. The goal is to make every alert meaningful and actionable.
Alert Taxonomy
Critical
Requires immediate human response, any time of day or night. Examples: production server completely down, database unreachable, SSL cert expired.
Warning
Requires attention within a few hours during business hours. Examples: disk at 80%, memory sustained above 85%, response times degrading.
Info
Worth knowing but no action required immediately. Examples: unusual traffic spike, a service restarted successfully, a maintenance window started.
Notification Channels and Their Uses
| Channel | Best for | Avoid for |
|---|---|---|
| PagerDuty / Opsgenie | Critical alerts, guaranteed delivery | Non-critical noise |
| SMS | Critical fallback | General monitoring |
| Slack / Discord | Warning/info, team visibility | Critical alerts (not reliable enough alone) |
| Info-level, weekly digests | Critical alerts (too slow) | |
| ntfy / Pushover | Personal setups, mobile push | Team routing |
Key insight: Do not rely on Slack or Discord for critical alerts. These platforms have outages, apps get muted, and notifications get buried. Use PagerDuty for anything that needs a guaranteed response.
PagerDuty Setup
PagerDuty's free Developer plan covers 5 users with full on-call scheduling, escalation policies, and multi-channel notifications.
Escalation policy
- Level 1 — Primary on-call: notify immediately via phone + SMS
- Escalate after — 15 minutes if not acknowledged
- Level 2 — Secondary on-call or backup contact
- Repeat policy — 2–3 times before acknowledging as no-response
Connecting to each monitoring tool
Alertmanager:
receivers:
- name: "pagerduty-critical"
pagerduty_configs:
- routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
severity: "critical"
details:
instance: "{{ .GroupLabels.instance }}"Uptime Kuma: Settings > Notifications > PagerDuty > paste integration key.
Gatus: Add pagerduty section under alerting in config.yaml.
Beszel: Settings > Notifications > PagerDuty > paste integration key.
Slack Integration
Create dedicated alert channels:
- •
#infra-critical— Critical alerts (mirrored from PagerDuty) - •
#infra-warnings— Warning level alerts - •
#infra-resolved— Resolved notifications (separate to reduce noise)
Alertmanager routing with Slack
route:
receiver: "slack-warnings"
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty-and-slack-critical"
repeat_interval: 1h
- match:
severity: warning
receiver: "slack-warnings"
repeat_interval: 4h
receivers:
- name: "pagerduty-and-slack-critical"
pagerduty_configs:
- routing_key: "${PAGERDUTY_INTEGRATION_KEY}"
slack_configs:
- api_url: "https://hooks.slack.com/services/CRITICAL_WEBHOOK_URL"
channel: "#infra-critical"
title: "CRITICAL: {{ .GroupLabels.alertname }}"
send_resolved: true
- name: "slack-warnings"
slack_configs:
- channel: "#infra-warnings"
title: "{{ .GroupLabels.alertname }}"
send_resolved: trueGood alert messages include
- • What is broken (service name, instance)
- • Why it triggered (the metric value that crossed the threshold)
- • A link to the relevant Grafana dashboard
- • A link to the runbook for this alert type
Discord Integration
Works similarly to Slack. Create separate webhooks for different channels/severity levels:
receivers:
- name: "discord-warnings"
discord_configs:
- webhook_url: "${DISCORD_WEBHOOK_WARNING}"
title: "{{ .GroupLabels.alertname }}"
- name: "discord-critical"
discord_configs:
- webhook_url: "${DISCORD_WEBHOOK_CRITICAL}"
title: "CRITICAL: {{ .GroupLabels.alertname }}"
message: |
@here
{{ range .Alerts }}
**{{ .Annotations.summary }}**
{{ end }}Email Alerting
Best for weekly summaries, info-level notifications, and compliance audit trails.
global:
smtp_smarthost: 'smtp.mailgun.org:587'
smtp_from: 'alerts@yourdomain.com'
smtp_auth_username: 'alerts@yourdomain.com'
smtp_auth_password: "${SMTP_PASSWORD}"
smtp_require_tls: true
receivers:
- name: "email-weekly-digest"
email_configs:
- to: "ops@yourdomain.com"
send_resolved: truePreventing Alert Fatigue
Deduplicate with grouping
route:
group_by: ["instance", "alertname"]
group_wait: 30s # Wait 30s to collect related alerts before sending
group_interval: 5m # Send updates for the same group every 5 minutes
repeat_interval: 4h # Resend a still-firing alert every 4 hoursUse inhibition rules
If an instance is down (critical), suppress all warning-level alerts for that same instance:
inhibit_rules:
- source_match:
alertname: NodeDown
target_match_re:
alertname: "HighCPU|HighMemory|DiskSpaceLow"
equal: ["instance"]Set appropriate time windows
A CPU spike that lasts 30 seconds is normal. Use the for clause:
- alert: HighCPUUsage
expr: cpu_usage > 85
for: 5m # Must be true for 5 minutes before alertingSeparate resolved notifications
Route resolved notifications to a separate #infra-resolved channel to keep firing alerts readable.
Runbooks
A runbook answers: "this alert fired — what do I do?" They do not need to be long. They need to be fast to execute under stress.
# Alert: HighDiskUsage
## What this means
Disk usage on the affected instance has exceeded 80%.
## Immediate triage (< 5 minutes)
1. SSH into the affected server: ssh deploy@INSTANCE_IP
2. Check which filesystem is affected: df -h
3. Find the largest directories: du -sh /* 2>/dev/null | sort -rh | head -20
## Common causes and fixes
### Docker logs consuming space
docker system prune -f --volumes
journalctl --vacuum-size=500M
### Application logs
find /var/log -name "*.log" -size +100M
truncate -s 0 /var/log/problematic.log
## Escalate if
- Disk is over 95% and you cannot free space within 15 minutes
- The filling directory is an application data directory you do not recognizeKeep runbooks in a Git repository. Link directly from your alert messages.
Incident Response Process
1. Detect
An alert fires. PagerDuty pages the on-call person.
2. Acknowledge
On-call engineer acknowledges within your SLA (common: 15 minutes for critical). Stops escalation.
3. Investigate
Follow the runbook. Use Grafana to correlate metrics. Determine root cause vs. symptom.
4. Resolve
Take corrective action. Confirm in monitoring that metrics return to normal.
5. Postmortem
For anything with user impact: what happened, what the impact was, what you changed to prevent recurrence.
Incident communication template
[INVESTIGATING] We are investigating reports of [service] being unavailable.
We will update this page every 30 minutes. Started: 14:32 UTC.
[UPDATE 15:00 UTC] Root cause identified as [brief description].
Fix in progress, estimated resolution: 15:45 UTC.
[RESOLVED 15:42 UTC] Service has been restored. Root cause: [brief description].
A full postmortem will be posted within 48 hours.Recommended Alert Routing Map
Alert fires
│
├─► severity: critical
│ ├─► PagerDuty (guaranteed delivery, on-call escalation)
│ └─► #infra-critical (Slack/Discord - team visibility)
│
├─► severity: warning
│ └─► #infra-warnings (Slack/Discord - monitored during business hours)
│
└─► severity: info
└─► Email digest (weekly, low priority) or suppressed entirely
Alert resolves
└─► #infra-resolved (separate channel, all severities)Store all webhook URLs and API keys in environment variables. Use Alertmanager as the single routing hub rather than configuring channels separately in every monitoring tool.
Series Summary
- Part 1 — Beszel — Hub-and-agent host metrics for your entire fleet at ~23 MB RAM per agent
- Part 2 — Uptime Kuma — External endpoint monitoring and public status pages
- Part 3 — Gatus — GitOps-friendly health checks in version-controlled YAML
- Part 4 — Grafana + Prometheus — Full observability with ad-hoc querying and custom dashboards
- Part 5 — Alerting — Coherent routing, deduplication, runbooks, and incident process
You do not need all of it at once. Most RamNode customers running a handful of servers will get 90% of the value from just Parts 1 and 2. The monitoring stack should cost you roughly $4–8/month in additional RamNode VPS capacity — one low-cost instance for Beszel + Uptime Kuma/Gatus, and a slightly larger instance if you add the Prometheus stack.
