Part 6 of 7

    Kubernetes in Production

    Monitoring, scaling, and reliability—the operational practices that keep Kubernetes clusters healthy.

    Prometheus
    Grafana
    Helm
    HPA
    Velero

    A running Kubernetes cluster is just the beginning. Production readiness means knowing what's happening inside your cluster, responding to problems before users notice, scaling to meet demand, and recovering when things go wrong.

    This guide covers the operational practices that keep Kubernetes clusters healthy: monitoring with Prometheus and Grafana, alerting, horizontal pod autoscaling, Helm for package management, backup strategies, and disaster recovery.

    1

    Monitoring with Prometheus and Grafana

    Kubernetes generates a wealth of metrics. Prometheus collects and stores them; Grafana visualizes them. Together, they're the standard monitoring stack for Kubernetes.

    Installing the kube-prometheus-stack

    The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Alertmanager, and pre-configured dashboards.

    Install Helm and add repository
    # Install Helm
    curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
    
    # Add Prometheus community repository
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    monitoring-values.yaml
    grafana:
      adminPassword: "your-secure-password"
      persistence:
        enabled: true
        size: 5Gi
      ingress:
        enabled: true
        annotations:
          cert-manager.io/cluster-issuer: letsencrypt-prod
        hosts:
          - grafana.yourdomain.com
        tls:
          - secretName: grafana-tls
            hosts:
              - grafana.yourdomain.com
    
    prometheus:
      prometheusSpec:
        retention: 15d
        storageSpec:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 20Gi
    
    alertmanager:
      alertmanagerSpec:
        storage:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 5Gi
    Install the monitoring stack
    kubectl create namespace monitoring
    helm install kube-prometheus prometheus-community/kube-prometheus-stack \
      -n monitoring \
      -f monitoring-values.yaml
    
    # Wait for pods
    kubectl get pods -n monitoring -w

    Key Metrics to Watch

    Cluster-level

    • node_cpu_seconds_total: CPU usage per node
    • node_memory_MemAvailable_bytes: Available memory
    • node_filesystem_avail_bytes: Disk space

    Kubernetes-level

    • kube_pod_status_phase: Pod states
    • kube_deployment_status_replicas_unavailable
    • container_memory_working_set_bytes

    Adding Custom Application Metrics

    ServiceMonitor for custom app
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: my-app
      namespace: monitoring
      labels:
        release: kube-prometheus  # Must match Prometheus selector
    spec:
      namespaceSelector:
        matchNames:
          - default
      selector:
        matchLabels:
          app: my-app
      endpoints:
        - port: metrics
          interval: 30s
          path: /metrics
    2

    Alerting

    Metrics are useless if no one sees them when things go wrong. Alertmanager handles alert routing and notifications.

    Alertmanager configuration (add to monitoring-values.yaml)
    alertmanager:
      config:
        global:
          resolve_timeout: 5m
        route:
          group_by: ['alertname', 'namespace']
          group_wait: 30s
          group_interval: 5m
          repeat_interval: 4h
          receiver: 'default'
          routes:
            - match:
                severity: critical
              receiver: 'critical'
        receivers:
          - name: 'default'
            email_configs:
              - to: 'alerts@yourdomain.com'
                from: 'alertmanager@yourdomain.com'
                smarthost: 'smtp.yourdomain.com:587'
                auth_username: 'smtp-user'
                auth_password: 'smtp-password'
          - name: 'critical'
            slack_configs:
              - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
                channel: '#alerts-critical'

    Built-in Alerts

    The kube-prometheus-stack includes alerts for common issues:

    • KubePodCrashLooping: Pod repeatedly crashing
    • KubePodNotReady: Pod stuck in non-ready state
    • NodeNotReady: Node offline
    • NodeMemoryHighUtilization: Node running low on memory
    • NodeFilesystemSpaceFillingUp: Disk filling up

    Creating Custom Alerts

    PrometheusRule for custom alerts
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: my-app-alerts
      namespace: monitoring
      labels:
        release: kube-prometheus
    spec:
      groups:
        - name: my-app
          rules:
            - alert: HighErrorRate
              expr: |
                sum(rate(http_requests_total{status=~"5.."}[5m])) 
                / sum(rate(http_requests_total[5m])) > 0.05
              for: 5m
              labels:
                severity: warning
              annotations:
                summary: "High error rate detected"
                description: "Error rate is {{ $value | humanizePercentage }}"
            
            - alert: HighLatency
              expr: |
                histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
              for: 5m
              labels:
                severity: warning
              annotations:
                summary: "High latency detected"
                description: "95th percentile latency is {{ $value }}s"
    3

    Helm for Package Management

    Helm is the package manager for Kubernetes. Instead of managing dozens of YAML files, you install applications as "charts" with configurable values.

    Helm Concepts

    • Chart: A package containing Kubernetes manifests, templates, and default values
    • Release: An installed instance of a chart
    • Repository: A collection of charts (like apt or npm registries)
    • Values: Configuration that customizes a chart for your needs
    Essential Helm commands
    # Search for charts
    helm search hub wordpress
    helm search repo prometheus
    
    # Add a repository
    helm repo add bitnami https://charts.bitnami.com/bitnami
    helm repo update
    
    # Show chart information
    helm show chart bitnami/wordpress
    helm show values bitnami/wordpress  # See configurable values
    
    # Install a chart
    helm install my-release bitnami/wordpress -f values.yaml
    
    # List installed releases
    helm list -A
    
    # Upgrade a release
    helm upgrade my-release bitnami/wordpress -f values.yaml
    
    # Rollback to previous version
    helm rollback my-release 1
    
    # Uninstall
    helm uninstall my-release

    Example: Installing Redis with Helm

    Install Redis
    # Add Bitnami repo
    helm repo add bitnami https://charts.bitnami.com/bitnami
    
    # Create values file
    cat <<EOF > redis-values.yaml
    architecture: standalone
    auth:
      enabled: true
      password: "your-redis-password"
    master:
      persistence:
        size: 2Gi
      resources:
        limits:
          memory: 256Mi
          cpu: 250m
    EOF
    
    # Install
    helm install redis bitnami/redis -f redis-values.yaml -n default
    
    # Get connection info
    helm status redis

    Creating Your Own Helm Charts

    Create and install custom chart
    # Generate chart structure
    helm create my-app
    
    # Structure created:
    # my-app/
    # ├── Chart.yaml          # Chart metadata
    # ├── values.yaml         # Default configuration
    # ├── templates/
    # │   ├── deployment.yaml
    # │   ├── service.yaml
    # │   ├── ingress.yaml
    # │   └── _helpers.tpl    # Template helpers
    # └── charts/             # Dependencies
    
    # Install from local chart
    helm install my-app ./my-app -f production-values.yaml
    4

    Horizontal Pod Autoscaling

    Kubernetes can automatically scale deployments based on CPU, memory, or custom metrics.

    Verify metrics-server is running
    kubectl get deployment metrics-server -n kube-system

    Basic CPU-Based Autoscaling

    HorizontalPodAutoscaler
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: my-app-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: my-app
      minReplicas: 2
      maxReplicas: 10
      metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 70

    Important: Your deployment must have resource requests defined for CPU autoscaling to work. Without requests, HPA can't calculate utilization percentage.

    Deployment with resource requests
    spec:
      containers:
        - name: app
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

    Multi-Metric Autoscaling with Behavior

    Advanced HPA with scaling behavior
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: my-app-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: my-app
      minReplicas: 2
      maxReplicas: 20
      metrics:
        - type: Resource
          resource:
            name: cpu
            target:
              type: Utilization
              averageUtilization: 70
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 80
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
          policies:
            - type: Percent
              value: 10
              periodSeconds: 60
        scaleUp:
          stabilizationWindowSeconds: 0
          policies:
            - type: Percent
              value: 100
              periodSeconds: 15
            - type: Pods
              value: 4
              periodSeconds: 15
          selectPolicy: Max
    Monitor autoscaling
    kubectl get hpa
    kubectl describe hpa my-app-hpa
    kubectl get hpa -w  # Watch in real-time
    5

    Backup Strategies

    Kubernetes state lives in etcd and persistent volumes. Back up both.

    Backing Up etcd (k3s)

    etcd snapshots with k3s
    # Create snapshot
    sudo k3s etcd-snapshot save --name manual-backup-$(date +%Y%m%d)
    
    # List snapshots
    sudo k3s etcd-snapshot ls
    
    # Snapshots stored in /var/lib/rancher/k3s/server/db/snapshots/
    Automate with cron
    # /etc/cron.d/k3s-etcd-backup
    0 */6 * * * root /usr/local/bin/k3s etcd-snapshot save --name scheduled-$(date +\%Y\%m\%d-\%H\%M)

    Backing Up with Velero

    Velero backs up Kubernetes resources and persistent volumes to object storage.

    Install Velero
    # Install CLI
    wget https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-amd64.tar.gz
    tar -xvf velero-v1.13.0-linux-amd64.tar.gz
    sudo mv velero-v1.13.0-linux-amd64/velero /usr/local/bin/
    
    # Install in cluster (with S3-compatible storage)
    velero install \
      --provider aws \
      --plugins velero/velero-plugin-for-aws:v1.9.0 \
      --bucket your-backup-bucket \
      --secret-file ./credentials-velero \
      --backup-location-config region=us-east-1,s3ForcePathStyle=true,s3Url=https://your-s3-endpoint \
      --use-volume-snapshots=false \
      --use-node-agent
    Create and schedule backups
    # Backup entire cluster
    velero backup create full-backup
    
    # Backup specific namespace
    velero backup create wordpress-backup --include-namespaces wordpress
    
    # Schedule automatic backups
    velero schedule create daily-backup --schedule="0 2 * * *" --ttl 168h   # Keep 7 days
    velero schedule create weekly-backup --schedule="0 3 * * 0" --ttl 720h  # Keep 30 days
    
    # List and restore
    velero backup get
    velero restore create --from-backup full-backup
    6

    Disaster Recovery

    When things go wrong, you need clear procedures.

    Scenario: Single Node Failure (Multi-Node Cluster)

    Kubernetes handles this automatically: Node becomes NotReady, pods are evicted after ~5 minutes, and rescheduled to healthy nodes.

    Monitor the process
    kubectl get nodes
    kubectl get pods -o wide
    kubectl get events --sort-by='.lastTimestamp'

    Scenario: etcd Failure (HA Cluster)

    With 3 server nodes, the cluster tolerates 1 failure. If you lose quorum (2+ failures):

    Reset cluster from local etcd
    # On a surviving server node
    sudo k3s server --cluster-reset
    
    # This resets to single-node using local etcd data
    # Then re-add other nodes

    Scenario: Complete Cluster Loss

    Restore from backup
    # Restore etcd snapshot on new server
    sudo k3s server \
      --cluster-init \
      --cluster-reset \
      --cluster-reset-restore-path=/path/to/snapshot
    
    # Or restore with Velero
    # 1. Install fresh cluster
    # 2. Install Velero with same configuration
    # 3. Restore from backup
    velero restore create --from-backup full-backup

    Disaster Recovery Checklist

    Before an incident:

    • etcd snapshots running on schedule
    • Velero backing up to off-site storage
    • Backup restoration tested recently
    • All manifests stored in version control
    • Runbooks documented for common scenarios

    During an incident:

    1. Assess scope (single pod, node, or cluster-wide)
    2. Check events: kubectl get events --sort-by='.lastTimestamp'
    3. Check node status: kubectl get nodes
    4. Review logs in Grafana/Loki
    5. Follow relevant runbook
    7

    Resource Optimization

    Running efficiently on VPS means right-sizing resources.

    Analyze resource usage
    # View current usage
    kubectl top nodes
    kubectl top pods -A
    
    # Compare to requests/limits
    kubectl describe node node-name | grep -A 5 "Allocated resources"

    Vertical Pod Autoscaler (VPA)

    VPA recommends or automatically adjusts resource requests:

    VPA for recommendations
    apiVersion: autoscaling.k8s.io/v1
    kind: VerticalPodAutoscaler
    metadata:
      name: my-app-vpa
    spec:
      targetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: my-app
      updatePolicy:
        updateMode: "Off"  # Just recommendations, no auto-update

    Pod Disruption Budgets

    Ensure availability during voluntary disruptions (upgrades, node maintenance):

    PodDisruptionBudget
    apiVersion: policy/v1
    kind: PodDisruptionBudget
    metadata:
      name: my-app-pdb
    spec:
      minAvailable: 2  # Or use maxUnavailable: 1
      selector:
        matchLabels:
          app: my-app

    Pod Priority and Preemption

    PriorityClass for critical workloads
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: high-priority
    value: 1000000
    globalDefault: false
    description: "High priority for critical workloads"
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: critical-app
    spec:
      template:
        spec:
          priorityClassName: high-priority
          containers:
            - name: app
              image: critical-app:latest
    8

    Production Checklist

    Reliability

    • Resource requests and limits set
    • Liveness and readiness probes configured
    • Pod Disruption Budgets for critical services
    • HPA configured where appropriate
    • Multiple replicas for stateless services

    Observability

    • Prometheus + Grafana deployed
    • Alertmanager configured with notifications
    • Alerts defined for critical conditions
    • Log aggregation in place (Loki)

    Security

    • Network policies restricting pod communication
    • Secrets stored properly (or external vault)
    • RBAC configured (not using cluster-admin)
    • Pod security standards enforced

    Backup & Recovery

    • etcd snapshots scheduled
    • Velero configured for resource backups
    • Persistent volume backups in place
    • Restoration procedure tested

    What's Next

    Your Kubernetes cluster is now production-ready: monitored, alerting, auto-scaling, and backed up. You understand the operational practices that keep clusters healthy.

    In Part 7, we'll put your infrastructure to work by deploying real-world applications like Nextcloud and Gitea, configuring production databases, and implementing high-availability patterns.