Part 9 of 10

    Log Analysis & Troubleshooting Workflows

    At 2 AM when your server is misbehaving, you don't want to be googling log parsing commands. Build a toolkit that turns "something's broken" into actionable insights in seconds.

    Log Parsing
    System Diagnostics
    Interactive Debugging

    At 2 AM when your server is misbehaving, you don't want to be frantically googling log parsing commands. You want answers: What's wrong? When did it start? What changed?

    This guide uses Claude Code to generate log analysis tools, diagnostic scripts, and troubleshooting workflows. We'll build a toolkit that turns "something's broken" into actionable insights in seconds.

    1

    Prerequisites

    • Claude Code installed (see Part 1)
    • A server with logs to analyze
    • Basic familiarity with common log formats
    2

    Universal Log Parser

    Let's start with a flexible log analysis tool that auto-detects formats and provides powerful filtering:

    "Create a universal log parser that auto-detects common log formats (syslog, nginx, Apache, JSON, Docker), extracts timestamps and severity levels, supports time-range filtering, provides frequency analysis, and includes presets for errors, traffic, auth, and slow requests."

    scripts/logparse.sh (excerpt)
    #!/bin/bash
    set -uo pipefail
    
    # =============================================================================
    # Universal Log Parser
    # =============================================================================
    
    VERSION="1.0.0"
    SCRIPT_NAME=$(basename "$0")
    
    # Colors
    RED='\033[0;31m'
    YELLOW='\033[0;33m'
    GREEN='\033[0;32m'
    NC='\033[0m'
    
    # Defaults
    OUTPUT_FORMAT="human"
    TIME_RANGE=""
    PATTERN=""
    PRESET=""
    FOLLOW=false
    LIMIT=100
    
    # Log Format Detection
    detect_format() {
        local sample="$1"
        
        # JSON format
        if echo "$sample" | head -1 | jq -e . >/dev/null 2>&1; then
            echo "json"; return
        fi
        
        # Nginx access log
        if echo "$sample" | grep -qE '^[0-9.]+ - .* \[.*\] "(GET|POST|PUT|DELETE|HEAD)'; then
            echo "nginx_access"; return
        fi
        
        # Syslog format
        if echo "$sample" | grep -qE '^[A-Z][a-z]{2} [ 0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}'; then
            echo "syslog"; return
        fi
        
        # Docker logs
        if echo "$sample" | grep -qE '^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z'; then
            echo "docker"; return
        fi
        
        echo "unknown"
    }
    
    # Preset Patterns
    apply_preset() {
        local preset="$1"
        case "$preset" in
            errors)
                PATTERN='(error|fail|fatal|critical|panic|exception|traceback)'
                ;;
            traffic)
                PATTERN='(GET|POST|PUT|DELETE|PATCH|HEAD|OPTIONS)'
                ;;
            auth)
                PATTERN='(auth|login|logout|password|session|sudo|sshd|pam_|accepted|failed)'
                ;;
            slow)
                PATTERN='([0-9]{4,}ms|slow|timeout|timed out|deadline exceeded)'
                ;;
        esac
    }
    
    # Colorize by severity
    colorize_line() {
        local line="$1"
        if echo "$line" | grep -qiE '(error|fail|fatal|critical|panic)'; then
            echo -e "${RED}${line}${NC}"
        elif echo "$line" | grep -qiE '(warn|warning)'; then
            echo -e "${YELLOW}${line}${NC}"
        elif echo "$line" | grep -qiE '(info|notice)'; then
            echo -e "${GREEN}${line}${NC}"
        else
            echo "$line"
        fi
    }
    Usage examples
    # Show errors from syslog
    ./logparse.sh -p errors /var/log/syslog
    
    # Analyze last hour of auth failures
    ./logparse.sh -t 1h -g "failed" /var/log/auth.log
    
    # Traffic analysis with stats
    ./logparse.sh -p traffic -s /var/log/nginx/access.log
    
    # Follow logs in real-time
    ./logparse.sh -f --grep "ERROR" /var/log/app/app.log
    
    # Pipe from journalctl
    journalctl -u nginx | ./logparse.sh -p errors
    3

    System Diagnostic Script

    When something's wrong, gather all relevant information quickly—in under 30 seconds:

    "Create a comprehensive system diagnostic script that collects system info, lists recent errors from journald, shows resource-heavy processes, checks service status, reviews network connections, identifies recent changes, and generates a single report."

    scripts/diagnose.sh (excerpt)
    #!/bin/bash
    set -uo pipefail
    
    # =============================================================================
    # System Diagnostic Script
    # Quick system health check for troubleshooting
    # =============================================================================
    
    section() { echo -e "\n\033[1m━━━ $1 ━━━\033[0m\n"; }
    ok() { echo -e "\033[0;32m✓\033[0m $1"; }
    warn() { echo -e "\033[0;33m⚠\033[0m $1"; }
    error() { echo -e "\033[0;31m✗\033[0m $1"; }
    
    echo "╔════════════════════════════════════════════════════════════╗"
    echo "║           SYSTEM DIAGNOSTIC REPORT                        ║"
    echo "║  Generated: $(date '+%Y-%m-%d %H:%M:%S')                          ║"
    echo "╚════════════════════════════════════════════════════════════╝"
    
    # System Overview
    section "SYSTEM OVERVIEW"
    echo "Uptime: $(uptime -p)"
    load_1m=$(uptime | grep -oE 'load average: [0-9., ]+' | cut -d: -f2 | cut -d, -f1 | tr -d ' ')
    cpu_count=$(nproc)
    echo "Load Average: $load_1m (CPUs: $cpu_count)"
    [[ $(echo "$load_1m > $cpu_count" | bc -l) -eq 1 ]] && warn "Load is high!" || ok "Load is normal"
    
    # Memory
    free -h | head -2
    mem_percent=$(free | awk '/Mem:/ {printf "%.0f", $3/$2 * 100}')
    [[ "$mem_percent" -gt 90 ]] && error "Memory critical: ${mem_percent}%" || ok "Memory: ${mem_percent}%"
    
    # Disk Usage
    section "DISK USAGE"
    df -h -x tmpfs -x devtmpfs | head -10
    df -h | awk 'NR>1 {gsub(/%/,"",$5); if($5>90) print "  ✗ " $6 " is " $5 "% full!"}'
    
    # Recent Errors
    section "RECENT ERRORS (Last Hour)"
    error_count=$(journalctl --since "1 hour ago" -p err --no-pager 2>/dev/null | wc -l)
    [[ "$error_count" -gt 0 ]] && warn "Found $error_count error entries" || ok "No errors"
    
    # OOM Kills
    oom_kills=$(journalctl --since "24 hours ago" 2>/dev/null | grep -c "Out of memory" || echo 0)
    [[ "$oom_kills" -gt 0 ]] && error "OOM Killer invoked $oom_kills time(s)!"
    
    # Failed Services
    section "SERVICE STATUS"
    failed=$(systemctl --failed --no-pager --no-legend 2>/dev/null | wc -l)
    [[ "$failed" -gt 0 ]] && error "Failed services: $failed" || ok "All services running"
    
    # Docker (if installed)
    if command -v docker &>/dev/null; then
        section "DOCKER"
        docker ps -a --format "table {{.Names}}\t{{.Status}}" | head -10
        unhealthy=$(docker ps --filter "health=unhealthy" --format "{{.Names}}")
        [[ -n "$unhealthy" ]] && error "Unhealthy: $unhealthy"
    fi
    4

    Application-Specific Debuggers

    Generate debuggers for common applications with quick, actionable output:

    scripts/debug-nginx.sh (excerpt)
    #!/bin/bash
    set -uo pipefail
    
    # =============================================================================
    # Nginx Debug Script
    # =============================================================================
    
    section() { echo -e "\n\033[1m=== $1 ===\033[0m\n"; }
    
    echo "Nginx Diagnostic Report - $(date)"
    
    # Configuration Test
    section "CONFIGURATION"
    nginx -t 2>&1 && echo "✓ Config syntax valid" || echo "✗ Config has errors!"
    
    # Process Status
    section "PROCESS STATUS"
    [[ -n "$(pgrep nginx)" ]] && echo "✓ Nginx running" || { echo "✗ Nginx not running!"; exit 1; }
    
    # Active Connections
    section "ACTIVE CONNECTIONS"
    curl -s http://127.0.0.1/nginx_status 2>/dev/null || \
        echo "Established: $(ss -tn state established '( dport = :80 or dport = :443 )' | wc -l)"
    
    # Recent Errors
    section "RECENT ERRORS"
    error_log="/var/log/nginx/error.log"
    [[ -f "$error_log" ]] && tail -100 "$error_log" | grep -cE '\[(error|crit)\]'
    
    # Upstream Health
    section "UPSTREAM STATUS"
    grep -rhoE 'proxy_pass http://[^;]+' /etc/nginx/ 2>/dev/null | \
        sed 's/proxy_pass //' | sort -u | while read upstream; do
        host=$(echo "$upstream" | sed 's|http://||' | cut -d/ -f1)
        timeout 2 bash -c "echo >/dev/tcp/${host%:*}/${host#*:}" 2>/dev/null \
            && echo "✓ $host reachable" || echo "✗ $host unreachable"
    done
    scripts/debug-postgres.sh (excerpt)
    #!/bin/bash
    set -uo pipefail
    
    # =============================================================================
    # PostgreSQL Debug Script
    # =============================================================================
    
    PGUSER="${PGUSER:-postgres}"
    run_query() { sudo -u postgres psql -c "$1" 2>/dev/null; }
    
    echo "PostgreSQL Diagnostic Report - $(date)"
    
    # Connection Status
    section "CONNECTION STATUS"
    run_query "SELECT count(*) as total, state FROM pg_stat_activity GROUP BY state;"
    
    # Active Queries
    section "ACTIVE QUERIES"
    run_query "
    SELECT pid, now() - query_start AS duration, state, substring(query, 1, 80)
    FROM pg_stat_activity
    WHERE state = 'active' AND pid <> pg_backend_pid()
    ORDER BY duration DESC LIMIT 10;
    "
    
    # Long Running Queries (>30s)
    section "LONG RUNNING QUERIES"
    run_query "
    SELECT pid, now() - query_start AS duration, usename, substring(query, 1, 100)
    FROM pg_stat_activity
    WHERE (now() - query_start) > interval '30 seconds' AND state = 'active'
    ORDER BY duration DESC LIMIT 10;
    "
    
    # Blocked Queries (Locks)
    section "LOCK ANALYSIS"
    run_query "
    SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid
    FROM pg_catalog.pg_locks blocked_locks
    JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.pid != blocked_locks.pid
    WHERE NOT blocked_locks.granted LIMIT 10;
    "
    
    # Cache Hit Ratio
    section "PERFORMANCE METRICS"
    run_query "
    SELECT sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100 as cache_hit_ratio
    FROM pg_statio_user_tables;
    "
    5

    Interactive Troubleshooting Helper

    Build an interactive diagnostic tool that routes to appropriate diagnostics based on the problem:

    "Create an interactive troubleshooting script with menu-driven navigation that handles: server is slow, service won't start, high memory/CPU, disk space issues, network problems, container issues, and database problems. Suggest fixes based on findings."

    scripts/troubleshoot.sh (excerpt)
    #!/bin/bash
    set -uo pipefail
    
    # =============================================================================
    # Interactive Troubleshooting Helper
    # =============================================================================
    
    declare -a FINDINGS=()
    declare -a SUGGESTIONS=()
    
    add_finding() { FINDINGS+=("$1"); echo -e "\033[0;33mFinding:\033[0m $1"; }
    add_suggestion() { SUGGESTIONS+=("$1"); }
    
    show_menu() {
        echo ""
        echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
        echo "What problem are you experiencing?"
        echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
        echo "  1) Server is slow / high load"
        echo "  2) Service won't start"
        echo "  3) High memory usage"
        echo "  4) High CPU usage"
        echo "  5) Disk space issues"
        echo "  6) Network connectivity problems"
        echo "  7) Docker/container issues"
        echo "  8) Database problems"
        echo "  9) Run full diagnostic"
        echo "  0) Exit"
        echo ""
    }
    
    diagnose_slow_server() {
        echo -e "\n\033[1mDiagnosing: Server Performance\033[0m\n"
        
        # Load average
        load_1m=$(uptime | grep -oE 'load average: [0-9., ]+' | cut -d: -f2 | cut -d, -f1 | tr -d ' ')
        cpus=$(nproc)
        echo "Load Average: $load_1m (CPUs: $cpus)"
        
        (( $(echo "$load_1m > $cpus * 2" | bc -l) )) && {
            add_finding "Very high load ($load_1m) - more than 2x CPU count"
            add_suggestion "Identify CPU-intensive processes"
        }
        
        # Memory pressure
        mem_avail=$(free -m | awk '/Mem:/ {print $7}')
        mem_total=$(free -m | awk '/Mem:/ {print $2}')
        mem_percent=$((100 - (mem_avail * 100 / mem_total)))
        echo "Memory: ${mem_percent}% used (${mem_avail}MB available)"
        
        [[ $mem_percent -gt 90 ]] && {
            add_finding "Critical memory pressure ($mem_percent%)"
            add_suggestion "Identify memory-hungry processes: ps aux --sort=-%mem | head"
        }
        
        # Top processes
        echo ""
        echo "Top CPU processes:"
        ps aux --sort=-%cpu | head -6
    }
    
    show_summary() {
        echo ""
        echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
        echo "SUMMARY"
        echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
        
        if [[ ${#FINDINGS[@]} -gt 0 ]]; then
            echo "Findings:"
            for finding in "${FINDINGS[@]}"; do echo "  • $finding"; done
        fi
        
        if [[ ${#SUGGESTIONS[@]} -gt 0 ]]; then
            echo "Suggestions:"
            for suggestion in "${SUGGESTIONS[@]}"; do echo "  → $suggestion"; done
        fi
    }
    6

    Log Aggregation Pipeline

    For servers with multiple services, aggregate logs into a queryable format:

    scripts/logagg.sh (excerpt)
    #!/bin/bash
    set -uo pipefail
    
    # =============================================================================
    # Log Aggregation Tool
    # Collects, normalizes, and queries logs from multiple sources
    # =============================================================================
    
    DB_FILE="${LOG_DB:-/var/log/aggregated/logs.db}"
    
    # Initialize SQLite database
    init_db() {
        sqlite3 "$DB_FILE" << 'SQL'
    CREATE TABLE IF NOT EXISTS logs (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        timestamp DATETIME,
        source TEXT,
        level TEXT,
        message TEXT,
        host TEXT
    );
    CREATE INDEX IF NOT EXISTS idx_timestamp ON logs(timestamp);
    CREATE INDEX IF NOT EXISTS idx_source ON logs(source);
    CREATE INDEX IF NOT EXISTS idx_level ON logs(level);
    SQL
    }
    
    # Ingest from journald
    ingest_journald() {
        local since="${1:-1h}"
        journalctl --since "$since ago" -o json | while read -r line; do
            local source=$(echo "$line" | jq -r '.SYSLOG_IDENTIFIER // "unknown"')
            local message=$(echo "$line" | jq -r '.MESSAGE // empty')
            local level=$(echo "$line" | jq -r '.PRIORITY // "6"')
            # Map priority to level and insert...
        done
    }
    
    # Query logs
    query_logs() {
        local query="SELECT datetime(timestamp), source, level, message FROM logs"
        # Apply filters (-s source, -l level, -t time, -g grep)...
        sqlite3 -header -column "$DB_FILE" "$query"
    }
    
    # Show statistics
    show_stats() {
        echo "=== Log Statistics ==="
        sqlite3 "$DB_FILE" "SELECT source, COUNT(*) FROM logs GROUP BY source ORDER BY 2 DESC LIMIT 10;"
        sqlite3 "$DB_FILE" "SELECT level, COUNT(*) FROM logs WHERE timestamp > datetime('now', '-24 hours') GROUP BY level;"
    }
    Usage examples
    # Initialize database
    ./logagg.sh init
    
    # Ingest logs
    ./logagg.sh ingest-journald 24h
    ./logagg.sh ingest-docker 1h
    
    # Query aggregated logs
    ./logagg.sh query -l error -t 1h
    ./logagg.sh query -s nginx -g "500"
    
    # Show statistics
    ./logagg.sh stats
    
    # Cleanup old logs
    ./logagg.sh cleanup 7
    7

    Tips & Quick Reference

    Troubleshooting Best Practices

    • Start broad, narrow down — Run full diagnostics first, then drill into specifics
    • Check the obvious first — Disk space, memory, load average
    • Compare to baseline — Know what "normal" looks like
    • Check recent changes — What was deployed? What was updated?
    • Correlate timestamps — When did the problem start?
    • Save reports — Document for future reference

    Quick Reference: Log Analysis Prompts

    NeedPrompt Pattern
    Log parser"Create log parser for [format] with [time filtering/pattern matching]"
    System diagnostics"Create diagnostic script checking [resources/services/errors]"
    App debugger"Create debug script for [nginx/postgres/docker] showing [connections/queries/health]"
    Interactive helper"Create troubleshooting menu for [problem types] with suggestions"
    Log aggregation"Create log collector for [sources] stored in [SQLite/JSON] with querying"