Kubeflow Deployment Guide on RamNode VPS | ML Platform on Kubernetes

Prerequisites

Kubeflow is resource-intensive. The following table outlines the minimum and recommended specifications for your RamNode VPS:

Resource	Minimum	Recommended	Notes
CPU Cores	4 vCPUs	8+ vCPUs	More cores improve pipeline parallelism
RAM	16 GB	32 GB+	Notebooks and training are memory-intensive
Storage	80 GB NVMe	200 GB+ NVMe	Datasets and model artifacts need space
OS	Ubuntu 24.04 LTS		Fresh installation recommended

💡 RamNode's Premium KVM plans with NVMe storage are ideal. The KVM 8GB or higher plans meet minimum requirements. Consider the 32GB plan for production workloads.

SSH access with a non-root sudo user configured
A registered domain name with DNS pointed to your VPS IP (optional but recommended for TLS)
A stable internet connection for downloading container images

Initial Server Setup

Update system and install base packages

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget git build-essential \
  apt-transport-https ca-certificates gnupg lsb-release \
  software-properties-common jq unzip

Configure System Limits

Kubernetes requires increased file descriptor and process limits:

Add to /etc/security/limits.conf

sudo bash -c 'cat >> /etc/security/limits.conf << EOF
*     soft     nofile     65536
*     hard     nofile     65536
*     soft     nproc      65536
*     hard     nproc      65536
EOF'

Disable Swap

Disable swap for Kubernetes

sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab

Enable Required Kernel Modules

Load kernel modules for container networking

sudo modprobe overlay
sudo modprobe br_netfilter

cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF

sudo sysctl --system

Install Container Runtime (containerd)

Kubeflow runs on Kubernetes, which requires a container runtime. We use containerd, the industry-standard runtime.

Install containerd from Docker repository

# Add Docker's official GPG key and repository (for containerd)
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
  sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) \
  signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update
sudo apt install -y containerd.io

Configure containerd for Kubernetes

Enable SystemdCgroup

sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml

# Enable SystemdCgroup (required for Kubernetes)
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' \
  /etc/containerd/config.toml

sudo systemctl restart containerd
sudo systemctl enable containerd

Install Kubernetes (K3s)

K3s provides a lightweight, production-ready Kubernetes distribution that conserves resources while maintaining full API compatibility — ideal for single-node Kubeflow deployments on RamNode.

Install K3s

curl -sfL https://get.k3s.io | sh -s - \
  --write-kubeconfig-mode 644 \
  --disable traefik \
  --disable servicelb \
  --kubelet-arg="max-pods=250" \
  --kube-apiserver-arg="service-node-port-range=80-32767"

Traefik and ServiceLB are disabled because Kubeflow includes its own Istio-based ingress. The increased max-pods limit accommodates Kubeflow's numerous microservices.

Configure kubectl

Set up kubeconfig

mkdir -p ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
sudo chown $(id -u):$(id -g) ~/.kube/config
export KUBECONFIG=~/.kube/config

# Persist for future sessions
echo 'export KUBECONFIG=~/.kube/config' >> ~/.bashrc

Verify Kubernetes and install Helm

kubectl get nodes
# NAME        STATUS   ROLES                  AGE   VERSION
# ramnode     Ready    control-plane,master   1m    v1.31.x+k3s1

Install Helm

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

Install Kubeflow

We use the official Kubeflow manifests with kustomize for a reproducible, version-pinned installation. This process takes 15–25 minutes depending on your VPS specs.

Install kustomize

curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
kustomize version

Clone and deploy Kubeflow manifests

cd ~
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Check out the latest stable release
git checkout v1.9-branch

# Deploy all components (retry loop handles CRD race conditions)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying to apply resources..."
  sleep 20
done

⚠️ Warning: Always use a stable release branch for production deployments. The main branch may contain breaking changes.

Monitor Deployment Progress

Watch pods across Kubeflow namespaces

kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
kubectl get pods -n kubeflow-user-example-com

# Or watch everything at once
kubectl get pods -A --watch

💡 All pods should reach Running or Completed status within 10–20 minutes. If a pod is stuck in ImagePullBackOff, check your internet connectivity and DNS resolution.

Access the Kubeflow Dashboard

Quick Access via Port-Forward

Port-forward the Istio gateway

kubectl port-forward svc/istio-ingressgateway \
  -n istio-system 8080:80 --address 0.0.0.0 &

Access the dashboard at http://YOUR_VPS_IP:8080. Default credentials: user@example.com / 12341234

⚠️ Warning: Change the default credentials immediately after your first login. See the Security Hardening section below.

Persistent Access with systemd

Create /etc/systemd/system/kubeflow-gateway.service

[Unit]
Description=Kubeflow Istio Gateway Port Forward
After=k3s.service
Requires=k3s.service

[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/kubectl port-forward \
  svc/istio-ingressgateway -n istio-system \
  8080:80 --address 0.0.0.0
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable the gateway service

sudo systemctl daemon-reload
sudo systemctl enable --now kubeflow-gateway

TLS with Nginx Reverse Proxy (Recommended)

Install Nginx and Certbot

sudo apt install -y nginx certbot python3-certbot-nginx

Create /etc/nginx/sites-available/kubeflow

server {
    listen 80;
    server_name kubeflow.yourdomain.com;

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 86400;
    }
}

Enable site and obtain TLS certificate

sudo ln -sf /etc/nginx/sites-available/kubeflow /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

# Obtain TLS certificate
sudo certbot --nginx -d kubeflow.yourdomain.com

Post-Installation Configuration

Change Default Credentials

Update the Dex static password

# Generate a new bcrypt hash for your password
python3 -c "import bcrypt; print(bcrypt.hashpw(\
  b'YOUR_SECURE_PASSWORD', bcrypt.gensalt()).decode())"

# Edit the Dex config
kubectl edit configmap dex -n auth

# Find the staticPasswords section and replace:
#  hash: (old hash)
# With your new bcrypt hash
#  email: admin@yourdomain.com

# Restart Dex to apply changes
kubectl rollout restart deployment dex -n auth

Configure Storage for Notebooks

Verify default storage class

kubectl get storageclass
# Should show local-path (default) — works out of the box with k3s
# For larger datasets, consider attaching additional block storage from RamNode

Resource Quotas

Set resource limits to prevent any single notebook or pipeline from consuming all resources:

Apply resource quotas

kubectl apply -f - <<EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: user-quota
  namespace: kubeflow-user-example-com
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 16Gi
    limits.cpu: "6"
    limits.memory: 24Gi
    persistentvolumeclaims: "10"
EOF

💡 Adjust these values based on your RamNode VPS plan. For a 32GB VPS, you can safely double these limits.

Security Hardening

Firewall Configuration

Configure UFW

sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow ssh
sudo ufw allow 80/tcp       # HTTP (redirects to HTTPS)
sudo ufw allow 443/tcp      # HTTPS
sudo ufw allow 6443/tcp     # Kubernetes API (restrict to your IP)
sudo ufw enable

# Restrict Kubernetes API to your IP only
sudo ufw delete allow 6443/tcp
sudo ufw allow from YOUR_IP to any port 6443 proto tcp

Network Policies

Restrict inter-namespace communication

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-external-egress
  namespace: kubeflow-user-example-com
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
    - ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP
        - port: 443
          protocol: TCP
EOF

Enable Audit Logging

Create audit policy

sudo mkdir -p /var/lib/rancher/k3s/server/audit
sudo tee /var/lib/rancher/k3s/server/audit/policy.yaml << 'EOF'
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: Metadata
    resources:
      - group: ""
        resources: ["secrets", "configmaps"]
  - level: RequestResponse
    users: ["system:anonymous"]
  - level: None
    resources:
      - group: ""
        resources: ["events"]
EOF

Your First ML Pipeline

Verify your deployment by creating a notebook server and running a sample ML pipeline.

Create a Notebook Server

Log in to the Kubeflow Dashboard
Navigate to Notebooks in the left sidebar
Click New Notebook — configure with name test-notebook, image kubeflownotebookswg/jupyter-scipy:v1.9.0, 1 CPU, 2Gi memory, 10Gi storage
Click Launch and wait for the notebook to start, then click Connect

Run a Sample Pipeline

Sample KFP pipeline in JupyterLab

# Install the KFP SDK
!pip install kfp==2.7.0

import kfp
from kfp import dsl

@dsl.component(base_image='python:3.11-slim')
def train_model(data_size: int) -> str:
    import random
    accuracy = 0.7 + random.random() * 0.25
    print(f'Trained model with accuracy: {accuracy:.4f}')
    return f'Model accuracy: {accuracy:.4f}'

@dsl.component(base_image='python:3.11-slim')
def evaluate_model(result: str) -> str:
    print(f'Evaluation complete: {result}')
    return f'Evaluation: PASSED - {result}'

@dsl.pipeline(name='ramnode-test-pipeline')
def test_pipeline():
    train = train_model(data_size=1000)
    evaluate_model(result=train.output)

# Compile and submit
kfp.compiler.Compiler().compile(test_pipeline, 'pipeline.yaml')
client = kfp.Client()
run = client.create_run_from_pipeline_package(
    'pipeline.yaml',
    arguments={},
    run_name='first-run'
)

Monitor Your Pipeline

Navigate to Runs in the Kubeflow Dashboard
Click on your pipeline run to view the DAG visualization
Click individual steps to view logs, inputs, and outputs
Verify both steps complete with green checkmarks

Performance Tuning for RamNode VPS

Storage I/O Optimization

RamNode's NVMe storage delivers excellent baseline performance. Optimize further with kernel parameters:

NVMe tuning parameters

sudo tee -a /etc/sysctl.d/99-kubeflow-tuning.conf << 'EOF'
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.swappiness = 0
vm.vfs_cache_pressure = 50
EOF

sudo sysctl --system

Container Image Caching

Pre-pull popular ML images

sudo k3s ctr images pull docker.io/kubeflownotebookswg/jupyter-scipy:v1.9.0
sudo k3s ctr images pull docker.io/kubeflownotebookswg/jupyter-pytorch-full:v1.9.0
sudo k3s ctr images pull docker.io/kubeflownotebookswg/jupyter-tensorflow-full:v1.9.0

Memory Management

Configure kubelet memory reservation

# Edit k3s service to add kubelet args
sudo systemctl edit k3s

# Add under [Service]:
# Environment="K3S_KUBELET_ARG=
#     --system-reserved=cpu=500m,memory=1Gi
#     --kube-reserved=cpu=500m,memory=1Gi"

sudo systemctl daemon-reload
sudo systemctl restart k3s

Maintenance & Operations

Backup Strategy

Backup Kubeflow resources

# Backup all Kubeflow resources
mkdir -p ~/kubeflow-backups/$(date +%Y%m%d)

for ns in kubeflow kubeflow-user-example-com istio-system \
  cert-manager auth knative-serving; do
  kubectl get all -n $ns -o yaml > \
    ~/kubeflow-backups/$(date +%Y%m%d)/$ns-resources.yaml
done

# Backup PersistentVolumeClaims (user data)
kubectl get pvc -A -o yaml > \
  ~/kubeflow-backups/$(date +%Y%m%d)/all-pvcs.yaml

# Schedule daily backups via cron
echo '0 2 * * * /home/$USER/scripts/backup-kubeflow.sh' | crontab -

Monitoring with kubectl

Cluster and component health checks

# Check cluster health
kubectl get nodes
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20

# Check Kubeflow component health
kubectl get pods -n kubeflow | grep -v Running

# View recent events for troubleshooting
kubectl get events -n kubeflow --sort-by=.metadata.creationTimestamp | tail -20

Upgrading Kubeflow

Back up your current deployment (see above)
Review the Kubeflow release notes for breaking changes
Update the manifests repository: git fetch && git checkout v{NEW_VERSION}-branch
Apply the updated manifests with the same kustomize command from the install step
Monitor pod status and verify all components restart successfully

Troubleshooting

Pods stuck in Pending

Insufficient resources. Check node resources with kubectl describe node. Consider upgrading your RamNode VPS plan.

ImagePullBackOff

Network or registry issue. Verify DNS resolution with nslookup registry-1.docker.io. Check /etc/resolv.conf points to a working nameserver.

Dashboard unreachable

Port-forward or Istio issue. Restart the gateway service: sudo systemctl restart kubeflow-gateway. Check Istio pods: kubectl get pods -n istio-system.

Notebook won't start

PVC or resource quota issue. Check events: kubectl describe pod -n kubeflow-user-example-com. Verify storage is available.

Pipeline runs fail

Service account permissions issue. Check pipeline runner SA: kubectl describe sa -n kubeflow-user-example-com. Review pod logs for auth errors.

OOMKilled pods

Insufficient memory. Increase VPS RAM or reduce resource requests. Check which pods consume the most: kubectl top pods -A --sort-by=memory.

Useful Diagnostic Commands

Cluster diagnostics

# Full cluster diagnostics
kubectl cluster-info dump > cluster-dump.txt

# Pod logs for a specific component
kubectl logs -n kubeflow deployment/ml-pipeline -c ml-pipeline-api-server

# Describe a failing pod
kubectl describe pod POD_NAME -n kubeflow

# Check disk usage
df -h
sudo k3s ctr images list | wc -l

# Clean unused images to free space
sudo k3s crictl rmi --prune

Kubeflow ML Platform Deployed Successfully!

Your Kubeflow machine learning platform is now running in production on a RamNode VPS with K3s, Istio ingress, Kubeflow Pipelines, Notebooks, and SSL encryption. RamNode's NVMe-backed VPS infrastructure provides the high-speed I/O and generous RAM that ML workloads demand — starting at just $4/month.