At 2 AM when your server is misbehaving, you don't want to be frantically googling log parsing commands. You want answers: What's wrong? When did it start? What changed?
This guide uses Claude Code to generate log analysis tools, diagnostic scripts, and troubleshooting workflows. We'll build a toolkit that turns "something's broken" into actionable insights in seconds.
Prerequisites
- Claude Code installed (see Part 1)
- A server with logs to analyze
- Basic familiarity with common log formats
Universal Log Parser
Let's start with a flexible log analysis tool that auto-detects formats and provides powerful filtering:
"Create a universal log parser that auto-detects common log formats (syslog, nginx, Apache, JSON, Docker), extracts timestamps and severity levels, supports time-range filtering, provides frequency analysis, and includes presets for errors, traffic, auth, and slow requests."
#!/bin/bash
set -uo pipefail
# =============================================================================
# Universal Log Parser
# =============================================================================
VERSION="1.0.0"
SCRIPT_NAME=$(basename "$0")
# Colors
RED='\033[0;31m'
YELLOW='\033[0;33m'
GREEN='\033[0;32m'
NC='\033[0m'
# Defaults
OUTPUT_FORMAT="human"
TIME_RANGE=""
PATTERN=""
PRESET=""
FOLLOW=false
LIMIT=100
# Log Format Detection
detect_format() {
local sample="$1"
# JSON format
if echo "$sample" | head -1 | jq -e . >/dev/null 2>&1; then
echo "json"; return
fi
# Nginx access log
if echo "$sample" | grep -qE '^[0-9.]+ - .* \[.*\] "(GET|POST|PUT|DELETE|HEAD)'; then
echo "nginx_access"; return
fi
# Syslog format
if echo "$sample" | grep -qE '^[A-Z][a-z]{2} [ 0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}'; then
echo "syslog"; return
fi
# Docker logs
if echo "$sample" | grep -qE '^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z'; then
echo "docker"; return
fi
echo "unknown"
}
# Preset Patterns
apply_preset() {
local preset="$1"
case "$preset" in
errors)
PATTERN='(error|fail|fatal|critical|panic|exception|traceback)'
;;
traffic)
PATTERN='(GET|POST|PUT|DELETE|PATCH|HEAD|OPTIONS)'
;;
auth)
PATTERN='(auth|login|logout|password|session|sudo|sshd|pam_|accepted|failed)'
;;
slow)
PATTERN='([0-9]{4,}ms|slow|timeout|timed out|deadline exceeded)'
;;
esac
}
# Colorize by severity
colorize_line() {
local line="$1"
if echo "$line" | grep -qiE '(error|fail|fatal|critical|panic)'; then
echo -e "${RED}${line}${NC}"
elif echo "$line" | grep -qiE '(warn|warning)'; then
echo -e "${YELLOW}${line}${NC}"
elif echo "$line" | grep -qiE '(info|notice)'; then
echo -e "${GREEN}${line}${NC}"
else
echo "$line"
fi
}# Show errors from syslog
./logparse.sh -p errors /var/log/syslog
# Analyze last hour of auth failures
./logparse.sh -t 1h -g "failed" /var/log/auth.log
# Traffic analysis with stats
./logparse.sh -p traffic -s /var/log/nginx/access.log
# Follow logs in real-time
./logparse.sh -f --grep "ERROR" /var/log/app/app.log
# Pipe from journalctl
journalctl -u nginx | ./logparse.sh -p errorsSystem Diagnostic Script
When something's wrong, gather all relevant information quickly—in under 30 seconds:
"Create a comprehensive system diagnostic script that collects system info, lists recent errors from journald, shows resource-heavy processes, checks service status, reviews network connections, identifies recent changes, and generates a single report."
#!/bin/bash
set -uo pipefail
# =============================================================================
# System Diagnostic Script
# Quick system health check for troubleshooting
# =============================================================================
section() { echo -e "\n\033[1m━━━ $1 ━━━\033[0m\n"; }
ok() { echo -e "\033[0;32m✓\033[0m $1"; }
warn() { echo -e "\033[0;33m⚠\033[0m $1"; }
error() { echo -e "\033[0;31m✗\033[0m $1"; }
echo "╔════════════════════════════════════════════════════════════╗"
echo "║ SYSTEM DIAGNOSTIC REPORT ║"
echo "║ Generated: $(date '+%Y-%m-%d %H:%M:%S') ║"
echo "╚════════════════════════════════════════════════════════════╝"
# System Overview
section "SYSTEM OVERVIEW"
echo "Uptime: $(uptime -p)"
load_1m=$(uptime | grep -oE 'load average: [0-9., ]+' | cut -d: -f2 | cut -d, -f1 | tr -d ' ')
cpu_count=$(nproc)
echo "Load Average: $load_1m (CPUs: $cpu_count)"
[[ $(echo "$load_1m > $cpu_count" | bc -l) -eq 1 ]] && warn "Load is high!" || ok "Load is normal"
# Memory
free -h | head -2
mem_percent=$(free | awk '/Mem:/ {printf "%.0f", $3/$2 * 100}')
[[ "$mem_percent" -gt 90 ]] && error "Memory critical: ${mem_percent}%" || ok "Memory: ${mem_percent}%"
# Disk Usage
section "DISK USAGE"
df -h -x tmpfs -x devtmpfs | head -10
df -h | awk 'NR>1 {gsub(/%/,"",$5); if($5>90) print " ✗ " $6 " is " $5 "% full!"}'
# Recent Errors
section "RECENT ERRORS (Last Hour)"
error_count=$(journalctl --since "1 hour ago" -p err --no-pager 2>/dev/null | wc -l)
[[ "$error_count" -gt 0 ]] && warn "Found $error_count error entries" || ok "No errors"
# OOM Kills
oom_kills=$(journalctl --since "24 hours ago" 2>/dev/null | grep -c "Out of memory" || echo 0)
[[ "$oom_kills" -gt 0 ]] && error "OOM Killer invoked $oom_kills time(s)!"
# Failed Services
section "SERVICE STATUS"
failed=$(systemctl --failed --no-pager --no-legend 2>/dev/null | wc -l)
[[ "$failed" -gt 0 ]] && error "Failed services: $failed" || ok "All services running"
# Docker (if installed)
if command -v docker &>/dev/null; then
section "DOCKER"
docker ps -a --format "table {{.Names}}\t{{.Status}}" | head -10
unhealthy=$(docker ps --filter "health=unhealthy" --format "{{.Names}}")
[[ -n "$unhealthy" ]] && error "Unhealthy: $unhealthy"
fiApplication-Specific Debuggers
Generate debuggers for common applications with quick, actionable output:
#!/bin/bash
set -uo pipefail
# =============================================================================
# Nginx Debug Script
# =============================================================================
section() { echo -e "\n\033[1m=== $1 ===\033[0m\n"; }
echo "Nginx Diagnostic Report - $(date)"
# Configuration Test
section "CONFIGURATION"
nginx -t 2>&1 && echo "✓ Config syntax valid" || echo "✗ Config has errors!"
# Process Status
section "PROCESS STATUS"
[[ -n "$(pgrep nginx)" ]] && echo "✓ Nginx running" || { echo "✗ Nginx not running!"; exit 1; }
# Active Connections
section "ACTIVE CONNECTIONS"
curl -s http://127.0.0.1/nginx_status 2>/dev/null || \
echo "Established: $(ss -tn state established '( dport = :80 or dport = :443 )' | wc -l)"
# Recent Errors
section "RECENT ERRORS"
error_log="/var/log/nginx/error.log"
[[ -f "$error_log" ]] && tail -100 "$error_log" | grep -cE '\[(error|crit)\]'
# Upstream Health
section "UPSTREAM STATUS"
grep -rhoE 'proxy_pass http://[^;]+' /etc/nginx/ 2>/dev/null | \
sed 's/proxy_pass //' | sort -u | while read upstream; do
host=$(echo "$upstream" | sed 's|http://||' | cut -d/ -f1)
timeout 2 bash -c "echo >/dev/tcp/${host%:*}/${host#*:}" 2>/dev/null \
&& echo "✓ $host reachable" || echo "✗ $host unreachable"
done#!/bin/bash
set -uo pipefail
# =============================================================================
# PostgreSQL Debug Script
# =============================================================================
PGUSER="${PGUSER:-postgres}"
run_query() { sudo -u postgres psql -c "$1" 2>/dev/null; }
echo "PostgreSQL Diagnostic Report - $(date)"
# Connection Status
section "CONNECTION STATUS"
run_query "SELECT count(*) as total, state FROM pg_stat_activity GROUP BY state;"
# Active Queries
section "ACTIVE QUERIES"
run_query "
SELECT pid, now() - query_start AS duration, state, substring(query, 1, 80)
FROM pg_stat_activity
WHERE state = 'active' AND pid <> pg_backend_pid()
ORDER BY duration DESC LIMIT 10;
"
# Long Running Queries (>30s)
section "LONG RUNNING QUERIES"
run_query "
SELECT pid, now() - query_start AS duration, usename, substring(query, 1, 100)
FROM pg_stat_activity
WHERE (now() - query_start) > interval '30 seconds' AND state = 'active'
ORDER BY duration DESC LIMIT 10;
"
# Blocked Queries (Locks)
section "LOCK ANALYSIS"
run_query "
SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.pid != blocked_locks.pid
WHERE NOT blocked_locks.granted LIMIT 10;
"
# Cache Hit Ratio
section "PERFORMANCE METRICS"
run_query "
SELECT sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100 as cache_hit_ratio
FROM pg_statio_user_tables;
"Interactive Troubleshooting Helper
Build an interactive diagnostic tool that routes to appropriate diagnostics based on the problem:
"Create an interactive troubleshooting script with menu-driven navigation that handles: server is slow, service won't start, high memory/CPU, disk space issues, network problems, container issues, and database problems. Suggest fixes based on findings."
#!/bin/bash
set -uo pipefail
# =============================================================================
# Interactive Troubleshooting Helper
# =============================================================================
declare -a FINDINGS=()
declare -a SUGGESTIONS=()
add_finding() { FINDINGS+=("$1"); echo -e "\033[0;33mFinding:\033[0m $1"; }
add_suggestion() { SUGGESTIONS+=("$1"); }
show_menu() {
echo ""
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "What problem are you experiencing?"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo " 1) Server is slow / high load"
echo " 2) Service won't start"
echo " 3) High memory usage"
echo " 4) High CPU usage"
echo " 5) Disk space issues"
echo " 6) Network connectivity problems"
echo " 7) Docker/container issues"
echo " 8) Database problems"
echo " 9) Run full diagnostic"
echo " 0) Exit"
echo ""
}
diagnose_slow_server() {
echo -e "\n\033[1mDiagnosing: Server Performance\033[0m\n"
# Load average
load_1m=$(uptime | grep -oE 'load average: [0-9., ]+' | cut -d: -f2 | cut -d, -f1 | tr -d ' ')
cpus=$(nproc)
echo "Load Average: $load_1m (CPUs: $cpus)"
(( $(echo "$load_1m > $cpus * 2" | bc -l) )) && {
add_finding "Very high load ($load_1m) - more than 2x CPU count"
add_suggestion "Identify CPU-intensive processes"
}
# Memory pressure
mem_avail=$(free -m | awk '/Mem:/ {print $7}')
mem_total=$(free -m | awk '/Mem:/ {print $2}')
mem_percent=$((100 - (mem_avail * 100 / mem_total)))
echo "Memory: ${mem_percent}% used (${mem_avail}MB available)"
[[ $mem_percent -gt 90 ]] && {
add_finding "Critical memory pressure ($mem_percent%)"
add_suggestion "Identify memory-hungry processes: ps aux --sort=-%mem | head"
}
# Top processes
echo ""
echo "Top CPU processes:"
ps aux --sort=-%cpu | head -6
}
show_summary() {
echo ""
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "SUMMARY"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
if [[ ${#FINDINGS[@]} -gt 0 ]]; then
echo "Findings:"
for finding in "${FINDINGS[@]}"; do echo " • $finding"; done
fi
if [[ ${#SUGGESTIONS[@]} -gt 0 ]]; then
echo "Suggestions:"
for suggestion in "${SUGGESTIONS[@]}"; do echo " → $suggestion"; done
fi
}Log Aggregation Pipeline
For servers with multiple services, aggregate logs into a queryable format:
#!/bin/bash
set -uo pipefail
# =============================================================================
# Log Aggregation Tool
# Collects, normalizes, and queries logs from multiple sources
# =============================================================================
DB_FILE="${LOG_DB:-/var/log/aggregated/logs.db}"
# Initialize SQLite database
init_db() {
sqlite3 "$DB_FILE" << 'SQL'
CREATE TABLE IF NOT EXISTS logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp DATETIME,
source TEXT,
level TEXT,
message TEXT,
host TEXT
);
CREATE INDEX IF NOT EXISTS idx_timestamp ON logs(timestamp);
CREATE INDEX IF NOT EXISTS idx_source ON logs(source);
CREATE INDEX IF NOT EXISTS idx_level ON logs(level);
SQL
}
# Ingest from journald
ingest_journald() {
local since="${1:-1h}"
journalctl --since "$since ago" -o json | while read -r line; do
local source=$(echo "$line" | jq -r '.SYSLOG_IDENTIFIER // "unknown"')
local message=$(echo "$line" | jq -r '.MESSAGE // empty')
local level=$(echo "$line" | jq -r '.PRIORITY // "6"')
# Map priority to level and insert...
done
}
# Query logs
query_logs() {
local query="SELECT datetime(timestamp), source, level, message FROM logs"
# Apply filters (-s source, -l level, -t time, -g grep)...
sqlite3 -header -column "$DB_FILE" "$query"
}
# Show statistics
show_stats() {
echo "=== Log Statistics ==="
sqlite3 "$DB_FILE" "SELECT source, COUNT(*) FROM logs GROUP BY source ORDER BY 2 DESC LIMIT 10;"
sqlite3 "$DB_FILE" "SELECT level, COUNT(*) FROM logs WHERE timestamp > datetime('now', '-24 hours') GROUP BY level;"
}# Initialize database
./logagg.sh init
# Ingest logs
./logagg.sh ingest-journald 24h
./logagg.sh ingest-docker 1h
# Query aggregated logs
./logagg.sh query -l error -t 1h
./logagg.sh query -s nginx -g "500"
# Show statistics
./logagg.sh stats
# Cleanup old logs
./logagg.sh cleanup 7Tips & Quick Reference
Troubleshooting Best Practices
- Start broad, narrow down — Run full diagnostics first, then drill into specifics
- Check the obvious first — Disk space, memory, load average
- Compare to baseline — Know what "normal" looks like
- Check recent changes — What was deployed? What was updated?
- Correlate timestamps — When did the problem start?
- Save reports — Document for future reference
Quick Reference: Log Analysis Prompts
| Need | Prompt Pattern |
|---|---|
| Log parser | "Create log parser for [format] with [time filtering/pattern matching]" |
| System diagnostics | "Create diagnostic script checking [resources/services/errors]" |
| App debugger | "Create debug script for [nginx/postgres/docker] showing [connections/queries/health]" |
| Interactive helper | "Create troubleshooting menu for [problem types] with suggestions" |
| Log aggregation | "Create log collector for [sources] stored in [SQLite/JSON] with querying" |
