Log Analysis & Troubleshooting Workflows | Claude Code Series Part 9

At 2 AM when your server is misbehaving, you don't want to be frantically googling log parsing commands. You want answers: What's wrong? When did it start? What changed?

This guide uses Claude Code to generate log analysis tools, diagnostic scripts, and troubleshooting workflows. We'll build a toolkit that turns "something's broken" into actionable insights in seconds.

Prerequisites

Claude Code installed (see Part 1)
A server with logs to analyze
Basic familiarity with common log formats

Universal Log Parser

Let's start with a flexible log analysis tool that auto-detects formats and provides powerful filtering:

"Create a universal log parser that auto-detects common log formats (syslog, nginx, Apache, JSON, Docker), extracts timestamps and severity levels, supports time-range filtering, provides frequency analysis, and includes presets for errors, traffic, auth, and slow requests."

scripts/logparse.sh (excerpt)

#!/bin/bash
set -uo pipefail

# =============================================================================
# Universal Log Parser
# =============================================================================

VERSION="1.0.0"
SCRIPT_NAME=$(basename "$0")

# Colors
RED='\033[0;31m'
YELLOW='\033[0;33m'
GREEN='\033[0;32m'
NC='\033[0m'

# Defaults
OUTPUT_FORMAT="human"
TIME_RANGE=""
PATTERN=""
PRESET=""
FOLLOW=false
LIMIT=100

# Log Format Detection
detect_format() {
    local sample="$1"
    
    # JSON format
    if echo "$sample" | head -1 | jq -e . >/dev/null 2>&1; then
        echo "json"; return
    fi
    
    # Nginx access log
    if echo "$sample" | grep -qE '^[0-9.]+ - .* \[.*\] "(GET|POST|PUT|DELETE|HEAD)'; then
        echo "nginx_access"; return
    fi
    
    # Syslog format
    if echo "$sample" | grep -qE '^[A-Z][a-z]{2} [ 0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}'; then
        echo "syslog"; return
    fi
    
    # Docker logs
    if echo "$sample" | grep -qE '^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z'; then
        echo "docker"; return
    fi
    
    echo "unknown"
}

# Preset Patterns
apply_preset() {
    local preset="$1"
    case "$preset" in
        errors)
            PATTERN='(error|fail|fatal|critical|panic|exception|traceback)'
            ;;
        traffic)
            PATTERN='(GET|POST|PUT|DELETE|PATCH|HEAD|OPTIONS)'
            ;;
        auth)
            PATTERN='(auth|login|logout|password|session|sudo|sshd|pam_|accepted|failed)'
            ;;
        slow)
            PATTERN='([0-9]{4,}ms|slow|timeout|timed out|deadline exceeded)'
            ;;
    esac
}

# Colorize by severity
colorize_line() {
    local line="$1"
    if echo "$line" | grep -qiE '(error|fail|fatal|critical|panic)'; then
        echo -e "${RED}${line}${NC}"
    elif echo "$line" | grep -qiE '(warn|warning)'; then
        echo -e "${YELLOW}${line}${NC}"
    elif echo "$line" | grep -qiE '(info|notice)'; then
        echo -e "${GREEN}${line}${NC}"
    else
        echo "$line"
    fi
}

Usage examples

# Show errors from syslog
./logparse.sh -p errors /var/log/syslog

# Analyze last hour of auth failures
./logparse.sh -t 1h -g "failed" /var/log/auth.log

# Traffic analysis with stats
./logparse.sh -p traffic -s /var/log/nginx/access.log

# Follow logs in real-time
./logparse.sh -f --grep "ERROR" /var/log/app/app.log

# Pipe from journalctl
journalctl -u nginx | ./logparse.sh -p errors

System Diagnostic Script

When something's wrong, gather all relevant information quickly—in under 30 seconds:

"Create a comprehensive system diagnostic script that collects system info, lists recent errors from journald, shows resource-heavy processes, checks service status, reviews network connections, identifies recent changes, and generates a single report."

scripts/diagnose.sh (excerpt)

#!/bin/bash
set -uo pipefail

# =============================================================================
# System Diagnostic Script
# Quick system health check for troubleshooting
# =============================================================================

section() { echo -e "\n\033[1m━━━ $1 ━━━\033[0m\n"; }
ok() { echo -e "\033[0;32m✓\033[0m $1"; }
warn() { echo -e "\033[0;33m⚠\033[0m $1"; }
error() { echo -e "\033[0;31m✗\033[0m $1"; }

echo "╔════════════════════════════════════════════════════════════╗"
echo "║           SYSTEM DIAGNOSTIC REPORT                        ║"
echo "║  Generated: $(date '+%Y-%m-%d %H:%M:%S')                          ║"
echo "╚════════════════════════════════════════════════════════════╝"

# System Overview
section "SYSTEM OVERVIEW"
echo "Uptime: $(uptime -p)"
load_1m=$(uptime | grep -oE 'load average: [0-9., ]+' | cut -d: -f2 | cut -d, -f1 | tr -d ' ')
cpu_count=$(nproc)
echo "Load Average: $load_1m (CPUs: $cpu_count)"
[[ $(echo "$load_1m > $cpu_count" | bc -l) -eq 1 ]] && warn "Load is high!" || ok "Load is normal"

# Memory
free -h | head -2
mem_percent=$(free | awk '/Mem:/ {printf "%.0f", $3/$2 * 100}')
[[ "$mem_percent" -gt 90 ]] && error "Memory critical: ${mem_percent}%" || ok "Memory: ${mem_percent}%"

# Disk Usage
section "DISK USAGE"
df -h -x tmpfs -x devtmpfs | head -10
df -h | awk 'NR>1 {gsub(/%/,"",$5); if($5>90) print "  ✗ " $6 " is " $5 "% full!"}'

# Recent Errors
section "RECENT ERRORS (Last Hour)"
error_count=$(journalctl --since "1 hour ago" -p err --no-pager 2>/dev/null | wc -l)
[[ "$error_count" -gt 0 ]] && warn "Found $error_count error entries" || ok "No errors"

# OOM Kills
oom_kills=$(journalctl --since "24 hours ago" 2>/dev/null | grep -c "Out of memory" || echo 0)
[[ "$oom_kills" -gt 0 ]] && error "OOM Killer invoked $oom_kills time(s)!"

# Failed Services
section "SERVICE STATUS"
failed=$(systemctl --failed --no-pager --no-legend 2>/dev/null | wc -l)
[[ "$failed" -gt 0 ]] && error "Failed services: $failed" || ok "All services running"

# Docker (if installed)
if command -v docker &>/dev/null; then
    section "DOCKER"
    docker ps -a --format "table {{.Names}}\t{{.Status}}" | head -10
    unhealthy=$(docker ps --filter "health=unhealthy" --format "{{.Names}}")
    [[ -n "$unhealthy" ]] && error "Unhealthy: $unhealthy"
fi

Application-Specific Debuggers

Generate debuggers for common applications with quick, actionable output:

scripts/debug-nginx.sh (excerpt)

#!/bin/bash
set -uo pipefail

# =============================================================================
# Nginx Debug Script
# =============================================================================

section() { echo -e "\n\033[1m=== $1 ===\033[0m\n"; }

echo "Nginx Diagnostic Report - $(date)"

# Configuration Test
section "CONFIGURATION"
nginx -t 2>&1 && echo "✓ Config syntax valid" || echo "✗ Config has errors!"

# Process Status
section "PROCESS STATUS"
[[ -n "$(pgrep nginx)" ]] && echo "✓ Nginx running" || { echo "✗ Nginx not running!"; exit 1; }

# Active Connections
section "ACTIVE CONNECTIONS"
curl -s http://127.0.0.1/nginx_status 2>/dev/null || \
    echo "Established: $(ss -tn state established '( dport = :80 or dport = :443 )' | wc -l)"

# Recent Errors
section "RECENT ERRORS"
error_log="/var/log/nginx/error.log"
[[ -f "$error_log" ]] && tail -100 "$error_log" | grep -cE '\[(error|crit)\]'

# Upstream Health
section "UPSTREAM STATUS"
grep -rhoE 'proxy_pass http://[^;]+' /etc/nginx/ 2>/dev/null | \
    sed 's/proxy_pass //' | sort -u | while read upstream; do
    host=$(echo "$upstream" | sed 's|http://||' | cut -d/ -f1)
    timeout 2 bash -c "echo >/dev/tcp/${host%:*}/${host#*:}" 2>/dev/null \
        && echo "✓ $host reachable" || echo "✗ $host unreachable"
done

scripts/debug-postgres.sh (excerpt)

#!/bin/bash
set -uo pipefail

# =============================================================================
# PostgreSQL Debug Script
# =============================================================================

PGUSER="${PGUSER:-postgres}"
run_query() { sudo -u postgres psql -c "$1" 2>/dev/null; }

echo "PostgreSQL Diagnostic Report - $(date)"

# Connection Status
section "CONNECTION STATUS"
run_query "SELECT count(*) as total, state FROM pg_stat_activity GROUP BY state;"

# Active Queries
section "ACTIVE QUERIES"
run_query "
SELECT pid, now() - query_start AS duration, state, substring(query, 1, 80)
FROM pg_stat_activity
WHERE state = 'active' AND pid <> pg_backend_pid()
ORDER BY duration DESC LIMIT 10;
"

# Long Running Queries (>30s)
section "LONG RUNNING QUERIES"
run_query "
SELECT pid, now() - query_start AS duration, usename, substring(query, 1, 100)
FROM pg_stat_activity
WHERE (now() - query_start) > interval '30 seconds' AND state = 'active'
ORDER BY duration DESC LIMIT 10;
"

# Blocked Queries (Locks)
section "LOCK ANALYSIS"
run_query "
SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.pid != blocked_locks.pid
WHERE NOT blocked_locks.granted LIMIT 10;
"

# Cache Hit Ratio
section "PERFORMANCE METRICS"
run_query "
SELECT sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100 as cache_hit_ratio
FROM pg_statio_user_tables;
"

Interactive Troubleshooting Helper

Build an interactive diagnostic tool that routes to appropriate diagnostics based on the problem:

"Create an interactive troubleshooting script with menu-driven navigation that handles: server is slow, service won't start, high memory/CPU, disk space issues, network problems, container issues, and database problems. Suggest fixes based on findings."

scripts/troubleshoot.sh (excerpt)

#!/bin/bash
set -uo pipefail

# =============================================================================
# Interactive Troubleshooting Helper
# =============================================================================

declare -a FINDINGS=()
declare -a SUGGESTIONS=()

add_finding() { FINDINGS+=("$1"); echo -e "\033[0;33mFinding:\033[0m $1"; }
add_suggestion() { SUGGESTIONS+=("$1"); }

show_menu() {
    echo ""
    echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
    echo "What problem are you experiencing?"
    echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
    echo "  1) Server is slow / high load"
    echo "  2) Service won't start"
    echo "  3) High memory usage"
    echo "  4) High CPU usage"
    echo "  5) Disk space issues"
    echo "  6) Network connectivity problems"
    echo "  7) Docker/container issues"
    echo "  8) Database problems"
    echo "  9) Run full diagnostic"
    echo "  0) Exit"
    echo ""
}

diagnose_slow_server() {
    echo -e "\n\033[1mDiagnosing: Server Performance\033[0m\n"
    
    # Load average
    load_1m=$(uptime | grep -oE 'load average: [0-9., ]+' | cut -d: -f2 | cut -d, -f1 | tr -d ' ')
    cpus=$(nproc)
    echo "Load Average: $load_1m (CPUs: $cpus)"
    
    (( $(echo "$load_1m > $cpus * 2" | bc -l) )) && {
        add_finding "Very high load ($load_1m) - more than 2x CPU count"
        add_suggestion "Identify CPU-intensive processes"
    }
    
    # Memory pressure
    mem_avail=$(free -m | awk '/Mem:/ {print $7}')
    mem_total=$(free -m | awk '/Mem:/ {print $2}')
    mem_percent=$((100 - (mem_avail * 100 / mem_total)))
    echo "Memory: ${mem_percent}% used (${mem_avail}MB available)"
    
    [[ $mem_percent -gt 90 ]] && {
        add_finding "Critical memory pressure ($mem_percent%)"
        add_suggestion "Identify memory-hungry processes: ps aux --sort=-%mem | head"
    }
    
    # Top processes
    echo ""
    echo "Top CPU processes:"
    ps aux --sort=-%cpu | head -6
}

show_summary() {
    echo ""
    echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
    echo "SUMMARY"
    echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
    
    if [[ ${#FINDINGS[@]} -gt 0 ]]; then
        echo "Findings:"
        for finding in "${FINDINGS[@]}"; do echo "  • $finding"; done
    fi
    
    if [[ ${#SUGGESTIONS[@]} -gt 0 ]]; then
        echo "Suggestions:"
        for suggestion in "${SUGGESTIONS[@]}"; do echo "  → $suggestion"; done
    fi
}

Log Aggregation Pipeline

For servers with multiple services, aggregate logs into a queryable format:

scripts/logagg.sh (excerpt)

#!/bin/bash
set -uo pipefail

# =============================================================================
# Log Aggregation Tool
# Collects, normalizes, and queries logs from multiple sources
# =============================================================================

DB_FILE="${LOG_DB:-/var/log/aggregated/logs.db}"

# Initialize SQLite database
init_db() {
    sqlite3 "$DB_FILE" << 'SQL'
CREATE TABLE IF NOT EXISTS logs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp DATETIME,
    source TEXT,
    level TEXT,
    message TEXT,
    host TEXT
);
CREATE INDEX IF NOT EXISTS idx_timestamp ON logs(timestamp);
CREATE INDEX IF NOT EXISTS idx_source ON logs(source);
CREATE INDEX IF NOT EXISTS idx_level ON logs(level);
SQL
}

# Ingest from journald
ingest_journald() {
    local since="${1:-1h}"
    journalctl --since "$since ago" -o json | while read -r line; do
        local source=$(echo "$line" | jq -r '.SYSLOG_IDENTIFIER // "unknown"')
        local message=$(echo "$line" | jq -r '.MESSAGE // empty')
        local level=$(echo "$line" | jq -r '.PRIORITY // "6"')
        # Map priority to level and insert...
    done
}

# Query logs
query_logs() {
    local query="SELECT datetime(timestamp), source, level, message FROM logs"
    # Apply filters (-s source, -l level, -t time, -g grep)...
    sqlite3 -header -column "$DB_FILE" "$query"
}

# Show statistics
show_stats() {
    echo "=== Log Statistics ==="
    sqlite3 "$DB_FILE" "SELECT source, COUNT(*) FROM logs GROUP BY source ORDER BY 2 DESC LIMIT 10;"
    sqlite3 "$DB_FILE" "SELECT level, COUNT(*) FROM logs WHERE timestamp > datetime('now', '-24 hours') GROUP BY level;"
}

Usage examples

# Initialize database
./logagg.sh init

# Ingest logs
./logagg.sh ingest-journald 24h
./logagg.sh ingest-docker 1h

# Query aggregated logs
./logagg.sh query -l error -t 1h
./logagg.sh query -s nginx -g "500"

# Show statistics
./logagg.sh stats

# Cleanup old logs
./logagg.sh cleanup 7

Tips & Quick Reference

Troubleshooting Best Practices

Start broad, narrow down — Run full diagnostics first, then drill into specifics
Check the obvious first — Disk space, memory, load average
Compare to baseline — Know what "normal" looks like
Check recent changes — What was deployed? What was updated?
Correlate timestamps — When did the problem start?
Save reports — Document for future reference

Quick Reference: Log Analysis Prompts

Need	Prompt Pattern
Log parser	"Create log parser for [format] with [time filtering/pattern matching]"
System diagnostics	"Create diagnostic script checking [resources/services/errors]"
App debugger	"Create debug script for [nginx/postgres/docker] showing [connections/queries/health]"
Interactive helper	"Create troubleshooting menu for [problem types] with suggestions"
Log aggregation	"Create log collector for [sources] stored in [SQLite/JSON] with querying"

Part 8: Security Hardening Part 10: Building Custom CLI Tools