AI / ML Platform Guide

    Deploy Kubeflow

    Deploy a complete machine learning platform with Kubeflow Pipelines, Notebooks, KServe, and Katib on RamNode's VPS hosting — starting at just $4/month.

    K3s Kubernetes
    Kubeflow v1.9
    Istio Ingress
    Let's Encrypt SSL
    1

    Prerequisites

    Kubeflow is resource-intensive. The following table outlines the minimum and recommended specifications for your RamNode VPS:

    ResourceMinimumRecommendedNotes
    CPU Cores4 vCPUs8+ vCPUsMore cores improve pipeline parallelism
    RAM16 GB32 GB+Notebooks and training are memory-intensive
    Storage80 GB NVMe200 GB+ NVMeDatasets and model artifacts need space
    OSUbuntu 24.04 LTSFresh installation recommended

    💡 RamNode's Premium KVM plans with NVMe storage are ideal. The KVM 8GB or higher plans meet minimum requirements. Consider the 32GB plan for production workloads.

    • SSH access with a non-root sudo user configured
    • A registered domain name with DNS pointed to your VPS IP (optional but recommended for TLS)
    • A stable internet connection for downloading container images
    2

    Initial Server Setup

    Update system and install base packages
    sudo apt update && sudo apt upgrade -y
    sudo apt install -y curl wget git build-essential \
      apt-transport-https ca-certificates gnupg lsb-release \
      software-properties-common jq unzip

    Configure System Limits

    Kubernetes requires increased file descriptor and process limits:

    Add to /etc/security/limits.conf
    sudo bash -c 'cat >> /etc/security/limits.conf << EOF
    *     soft     nofile     65536
    *     hard     nofile     65536
    *     soft     nproc      65536
    *     hard     nproc      65536
    EOF'

    Disable Swap

    Disable swap for Kubernetes
    sudo swapoff -a
    sudo sed -i '/ swap / s/^/#/' /etc/fstab

    Enable Required Kernel Modules

    Load kernel modules for container networking
    sudo modprobe overlay
    sudo modprobe br_netfilter
    
    cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
    overlay
    br_netfilter
    EOF
    
    cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
    net.bridge.bridge-nf-call-iptables = 1
    net.bridge.bridge-nf-call-ip6tables = 1
    net.ipv4.ip_forward = 1
    EOF
    
    sudo sysctl --system
    3

    Install Container Runtime (containerd)

    Kubeflow runs on Kubernetes, which requires a container runtime. We use containerd, the industry-standard runtime.

    Install containerd from Docker repository
    # Add Docker's official GPG key and repository (for containerd)
    sudo install -m 0755 -d /etc/apt/keyrings
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
      sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
    sudo chmod a+r /etc/apt/keyrings/docker.gpg
    
    echo "deb [arch=$(dpkg --print-architecture) \
      signed-by=/etc/apt/keyrings/docker.gpg] \
      https://download.docker.com/linux/ubuntu \
      $(lsb_release -cs) stable" | \
      sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
    
    sudo apt update
    sudo apt install -y containerd.io

    Configure containerd for Kubernetes

    Enable SystemdCgroup
    sudo mkdir -p /etc/containerd
    containerd config default | sudo tee /etc/containerd/config.toml
    
    # Enable SystemdCgroup (required for Kubernetes)
    sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' \
      /etc/containerd/config.toml
    
    sudo systemctl restart containerd
    sudo systemctl enable containerd
    4

    Install Kubernetes (K3s)

    K3s provides a lightweight, production-ready Kubernetes distribution that conserves resources while maintaining full API compatibility — ideal for single-node Kubeflow deployments on RamNode.

    Install K3s
    curl -sfL https://get.k3s.io | sh -s - \
      --write-kubeconfig-mode 644 \
      --disable traefik \
      --disable servicelb \
      --kubelet-arg="max-pods=250" \
      --kube-apiserver-arg="service-node-port-range=80-32767"

    Traefik and ServiceLB are disabled because Kubeflow includes its own Istio-based ingress. The increased max-pods limit accommodates Kubeflow's numerous microservices.

    Configure kubectl

    Set up kubeconfig
    mkdir -p ~/.kube
    sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
    sudo chown $(id -u):$(id -g) ~/.kube/config
    export KUBECONFIG=~/.kube/config
    
    # Persist for future sessions
    echo 'export KUBECONFIG=~/.kube/config' >> ~/.bashrc
    Verify Kubernetes and install Helm
    kubectl get nodes
    # NAME        STATUS   ROLES                  AGE   VERSION
    # ramnode     Ready    control-plane,master   1m    v1.31.x+k3s1
    Install Helm
    curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
    helm version
    5

    Install Kubeflow

    We use the official Kubeflow manifests with kustomize for a reproducible, version-pinned installation. This process takes 15–25 minutes depending on your VPS specs.

    Install kustomize
    curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
    sudo mv kustomize /usr/local/bin/
    kustomize version
    Clone and deploy Kubeflow manifests
    cd ~
    git clone https://github.com/kubeflow/manifests.git
    cd manifests
    
    # Check out the latest stable release
    git checkout v1.9-branch
    
    # Deploy all components (retry loop handles CRD race conditions)
    while ! kustomize build example | kubectl apply -f -; do
      echo "Retrying to apply resources..."
      sleep 20
    done

    ⚠️ Warning: Always use a stable release branch for production deployments. The main branch may contain breaking changes.

    Monitor Deployment Progress

    Watch pods across Kubeflow namespaces
    kubectl get pods -n cert-manager
    kubectl get pods -n istio-system
    kubectl get pods -n auth
    kubectl get pods -n knative-eventing
    kubectl get pods -n knative-serving
    kubectl get pods -n kubeflow
    kubectl get pods -n kubeflow-user-example-com
    
    # Or watch everything at once
    kubectl get pods -A --watch

    💡 All pods should reach Running or Completed status within 10–20 minutes. If a pod is stuck in ImagePullBackOff, check your internet connectivity and DNS resolution.

    6

    Access the Kubeflow Dashboard

    Quick Access via Port-Forward

    Port-forward the Istio gateway
    kubectl port-forward svc/istio-ingressgateway \
      -n istio-system 8080:80 --address 0.0.0.0 &

    Access the dashboard at http://YOUR_VPS_IP:8080. Default credentials: user@example.com / 12341234

    ⚠️ Warning: Change the default credentials immediately after your first login. See the Security Hardening section below.

    Persistent Access with systemd

    Create /etc/systemd/system/kubeflow-gateway.service
    [Unit]
    Description=Kubeflow Istio Gateway Port Forward
    After=k3s.service
    Requires=k3s.service
    
    [Service]
    Type=simple
    User=root
    ExecStart=/usr/local/bin/kubectl port-forward \
      svc/istio-ingressgateway -n istio-system \
      8080:80 --address 0.0.0.0
    Restart=always
    RestartSec=10
    
    [Install]
    WantedBy=multi-user.target
    Enable the gateway service
    sudo systemctl daemon-reload
    sudo systemctl enable --now kubeflow-gateway

    TLS with Nginx Reverse Proxy (Recommended)

    Install Nginx and Certbot
    sudo apt install -y nginx certbot python3-certbot-nginx
    Create /etc/nginx/sites-available/kubeflow
    server {
        listen 80;
        server_name kubeflow.yourdomain.com;
    
        location / {
            proxy_pass http://127.0.0.1:8080;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_read_timeout 86400;
        }
    }
    Enable site and obtain TLS certificate
    sudo ln -sf /etc/nginx/sites-available/kubeflow /etc/nginx/sites-enabled/
    sudo nginx -t && sudo systemctl reload nginx
    
    # Obtain TLS certificate
    sudo certbot --nginx -d kubeflow.yourdomain.com
    7

    Post-Installation Configuration

    Change Default Credentials

    Update the Dex static password
    # Generate a new bcrypt hash for your password
    python3 -c "import bcrypt; print(bcrypt.hashpw(\
      b'YOUR_SECURE_PASSWORD', bcrypt.gensalt()).decode())"
    
    # Edit the Dex config
    kubectl edit configmap dex -n auth
    
    # Find the staticPasswords section and replace:
    #  hash: (old hash)
    # With your new bcrypt hash
    #  email: admin@yourdomain.com
    
    # Restart Dex to apply changes
    kubectl rollout restart deployment dex -n auth

    Configure Storage for Notebooks

    Verify default storage class
    kubectl get storageclass
    # Should show local-path (default) — works out of the box with k3s
    # For larger datasets, consider attaching additional block storage from RamNode

    Resource Quotas

    Set resource limits to prevent any single notebook or pipeline from consuming all resources:

    Apply resource quotas
    kubectl apply -f - <<EOF
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: user-quota
      namespace: kubeflow-user-example-com
    spec:
      hard:
        requests.cpu: "4"
        requests.memory: 16Gi
        limits.cpu: "6"
        limits.memory: 24Gi
        persistentvolumeclaims: "10"
    EOF

    💡 Adjust these values based on your RamNode VPS plan. For a 32GB VPS, you can safely double these limits.

    8

    Security Hardening

    Firewall Configuration

    Configure UFW
    sudo ufw default deny incoming
    sudo ufw default allow outgoing
    sudo ufw allow ssh
    sudo ufw allow 80/tcp       # HTTP (redirects to HTTPS)
    sudo ufw allow 443/tcp      # HTTPS
    sudo ufw allow 6443/tcp     # Kubernetes API (restrict to your IP)
    sudo ufw enable
    
    # Restrict Kubernetes API to your IP only
    sudo ufw delete allow 6443/tcp
    sudo ufw allow from YOUR_IP to any port 6443 proto tcp

    Network Policies

    Restrict inter-namespace communication
    kubectl apply -f - <<EOF
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: deny-external-egress
      namespace: kubeflow-user-example-com
    spec:
      podSelector: {}
      policyTypes:
        - Egress
      egress:
        - to:
            - namespaceSelector: {}
        - ports:
            - port: 53
              protocol: UDP
            - port: 53
              protocol: TCP
            - port: 443
              protocol: TCP
    EOF

    Enable Audit Logging

    Create audit policy
    sudo mkdir -p /var/lib/rancher/k3s/server/audit
    sudo tee /var/lib/rancher/k3s/server/audit/policy.yaml << 'EOF'
    apiVersion: audit.k8s.io/v1
    kind: Policy
    rules:
      - level: Metadata
        resources:
          - group: ""
            resources: ["secrets", "configmaps"]
      - level: RequestResponse
        users: ["system:anonymous"]
      - level: None
        resources:
          - group: ""
            resources: ["events"]
    EOF
    9

    Your First ML Pipeline

    Verify your deployment by creating a notebook server and running a sample ML pipeline.

    Create a Notebook Server

    1. Log in to the Kubeflow Dashboard
    2. Navigate to Notebooks in the left sidebar
    3. Click New Notebook — configure with name test-notebook, image kubeflownotebookswg/jupyter-scipy:v1.9.0, 1 CPU, 2Gi memory, 10Gi storage
    4. Click Launch and wait for the notebook to start, then click Connect

    Run a Sample Pipeline

    Sample KFP pipeline in JupyterLab
    # Install the KFP SDK
    !pip install kfp==2.7.0
    
    import kfp
    from kfp import dsl
    
    @dsl.component(base_image='python:3.11-slim')
    def train_model(data_size: int) -> str:
        import random
        accuracy = 0.7 + random.random() * 0.25
        print(f'Trained model with accuracy: {accuracy:.4f}')
        return f'Model accuracy: {accuracy:.4f}'
    
    @dsl.component(base_image='python:3.11-slim')
    def evaluate_model(result: str) -> str:
        print(f'Evaluation complete: {result}')
        return f'Evaluation: PASSED - {result}'
    
    @dsl.pipeline(name='ramnode-test-pipeline')
    def test_pipeline():
        train = train_model(data_size=1000)
        evaluate_model(result=train.output)
    
    # Compile and submit
    kfp.compiler.Compiler().compile(test_pipeline, 'pipeline.yaml')
    client = kfp.Client()
    run = client.create_run_from_pipeline_package(
        'pipeline.yaml',
        arguments={},
        run_name='first-run'
    )

    Monitor Your Pipeline

    1. Navigate to Runs in the Kubeflow Dashboard
    2. Click on your pipeline run to view the DAG visualization
    3. Click individual steps to view logs, inputs, and outputs
    4. Verify both steps complete with green checkmarks
    10

    Performance Tuning for RamNode VPS

    Storage I/O Optimization

    RamNode's NVMe storage delivers excellent baseline performance. Optimize further with kernel parameters:

    NVMe tuning parameters
    sudo tee -a /etc/sysctl.d/99-kubeflow-tuning.conf << 'EOF'
    vm.dirty_ratio = 10
    vm.dirty_background_ratio = 5
    vm.swappiness = 0
    vm.vfs_cache_pressure = 50
    EOF
    
    sudo sysctl --system

    Container Image Caching

    Pre-pull popular ML images
    sudo k3s ctr images pull docker.io/kubeflownotebookswg/jupyter-scipy:v1.9.0
    sudo k3s ctr images pull docker.io/kubeflownotebookswg/jupyter-pytorch-full:v1.9.0
    sudo k3s ctr images pull docker.io/kubeflownotebookswg/jupyter-tensorflow-full:v1.9.0

    Memory Management

    Configure kubelet memory reservation
    # Edit k3s service to add kubelet args
    sudo systemctl edit k3s
    
    # Add under [Service]:
    # Environment="K3S_KUBELET_ARG=
    #     --system-reserved=cpu=500m,memory=1Gi
    #     --kube-reserved=cpu=500m,memory=1Gi"
    
    sudo systemctl daemon-reload
    sudo systemctl restart k3s
    11

    Maintenance & Operations

    Backup Strategy

    Backup Kubeflow resources
    # Backup all Kubeflow resources
    mkdir -p ~/kubeflow-backups/$(date +%Y%m%d)
    
    for ns in kubeflow kubeflow-user-example-com istio-system \
      cert-manager auth knative-serving; do
      kubectl get all -n $ns -o yaml > \
        ~/kubeflow-backups/$(date +%Y%m%d)/$ns-resources.yaml
    done
    
    # Backup PersistentVolumeClaims (user data)
    kubectl get pvc -A -o yaml > \
      ~/kubeflow-backups/$(date +%Y%m%d)/all-pvcs.yaml
    
    # Schedule daily backups via cron
    echo '0 2 * * * /home/$USER/scripts/backup-kubeflow.sh' | crontab -

    Monitoring with kubectl

    Cluster and component health checks
    # Check cluster health
    kubectl get nodes
    kubectl top nodes
    kubectl top pods -A --sort-by=memory | head -20
    
    # Check Kubeflow component health
    kubectl get pods -n kubeflow | grep -v Running
    
    # View recent events for troubleshooting
    kubectl get events -n kubeflow --sort-by=.metadata.creationTimestamp | tail -20

    Upgrading Kubeflow

    1. Back up your current deployment (see above)
    2. Review the Kubeflow release notes for breaking changes
    3. Update the manifests repository: git fetch && git checkout v{NEW_VERSION}-branch
    4. Apply the updated manifests with the same kustomize command from the install step
    5. Monitor pod status and verify all components restart successfully
    12

    Troubleshooting

    Pods stuck in Pending

    Insufficient resources. Check node resources with kubectl describe node. Consider upgrading your RamNode VPS plan.

    ImagePullBackOff

    Network or registry issue. Verify DNS resolution with nslookup registry-1.docker.io. Check /etc/resolv.conf points to a working nameserver.

    Dashboard unreachable

    Port-forward or Istio issue. Restart the gateway service: sudo systemctl restart kubeflow-gateway. Check Istio pods: kubectl get pods -n istio-system.

    Notebook won't start

    PVC or resource quota issue. Check events: kubectl describe pod -n kubeflow-user-example-com. Verify storage is available.

    Pipeline runs fail

    Service account permissions issue. Check pipeline runner SA: kubectl describe sa -n kubeflow-user-example-com. Review pod logs for auth errors.

    OOMKilled pods

    Insufficient memory. Increase VPS RAM or reduce resource requests. Check which pods consume the most: kubectl top pods -A --sort-by=memory.

    Useful Diagnostic Commands

    Cluster diagnostics
    # Full cluster diagnostics
    kubectl cluster-info dump > cluster-dump.txt
    
    # Pod logs for a specific component
    kubectl logs -n kubeflow deployment/ml-pipeline -c ml-pipeline-api-server
    
    # Describe a failing pod
    kubectl describe pod POD_NAME -n kubeflow
    
    # Check disk usage
    df -h
    sudo k3s ctr images list | wc -l
    
    # Clean unused images to free space
    sudo k3s crictl rmi --prune

    Kubeflow ML Platform Deployed Successfully!

    Your Kubeflow machine learning platform is now running in production on a RamNode VPS with K3s, Istio ingress, Kubeflow Pipelines, Notebooks, and SSL encryption. RamNode's NVMe-backed VPS infrastructure provides the high-speed I/O and generous RAM that ML workloads demand — starting at just $4/month.