Prerequisites
Kubeflow is resource-intensive. The following table outlines the minimum and recommended specifications for your RamNode VPS:
| Resource | Minimum | Recommended | Notes |
|---|---|---|---|
| CPU Cores | 4 vCPUs | 8+ vCPUs | More cores improve pipeline parallelism |
| RAM | 16 GB | 32 GB+ | Notebooks and training are memory-intensive |
| Storage | 80 GB NVMe | 200 GB+ NVMe | Datasets and model artifacts need space |
| OS | Ubuntu 24.04 LTS | Fresh installation recommended | |
💡 RamNode's Premium KVM plans with NVMe storage are ideal. The KVM 8GB or higher plans meet minimum requirements. Consider the 32GB plan for production workloads.
- SSH access with a non-root sudo user configured
- A registered domain name with DNS pointed to your VPS IP (optional but recommended for TLS)
- A stable internet connection for downloading container images
Initial Server Setup
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget git build-essential \
apt-transport-https ca-certificates gnupg lsb-release \
software-properties-common jq unzipConfigure System Limits
Kubernetes requires increased file descriptor and process limits:
sudo bash -c 'cat >> /etc/security/limits.conf << EOF
* soft nofile 65536
* hard nofile 65536
* soft nproc 65536
* hard nproc 65536
EOF'Disable Swap
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstabEnable Required Kernel Modules
sudo modprobe overlay
sudo modprobe br_netfilter
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --systemInstall Container Runtime (containerd)
Kubeflow runs on Kubernetes, which requires a container runtime. We use containerd, the industry-standard runtime.
# Add Docker's official GPG key and repository (for containerd)
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) \
signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y containerd.ioConfigure containerd for Kubernetes
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
# Enable SystemdCgroup (required for Kubernetes)
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' \
/etc/containerd/config.toml
sudo systemctl restart containerd
sudo systemctl enable containerdInstall Kubernetes (K3s)
K3s provides a lightweight, production-ready Kubernetes distribution that conserves resources while maintaining full API compatibility — ideal for single-node Kubeflow deployments on RamNode.
curl -sfL https://get.k3s.io | sh -s - \
--write-kubeconfig-mode 644 \
--disable traefik \
--disable servicelb \
--kubelet-arg="max-pods=250" \
--kube-apiserver-arg="service-node-port-range=80-32767"Traefik and ServiceLB are disabled because Kubeflow includes its own Istio-based ingress. The increased max-pods limit accommodates Kubeflow's numerous microservices.
Configure kubectl
mkdir -p ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
sudo chown $(id -u):$(id -g) ~/.kube/config
export KUBECONFIG=~/.kube/config
# Persist for future sessions
echo 'export KUBECONFIG=~/.kube/config' >> ~/.bashrckubectl get nodes
# NAME STATUS ROLES AGE VERSION
# ramnode Ready control-plane,master 1m v1.31.x+k3s1curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm versionInstall Kubeflow
We use the official Kubeflow manifests with kustomize for a reproducible, version-pinned installation. This process takes 15–25 minutes depending on your VPS specs.
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
kustomize versioncd ~
git clone https://github.com/kubeflow/manifests.git
cd manifests
# Check out the latest stable release
git checkout v1.9-branch
# Deploy all components (retry loop handles CRD race conditions)
while ! kustomize build example | kubectl apply -f -; do
echo "Retrying to apply resources..."
sleep 20
done⚠️ Warning: Always use a stable release branch for production deployments. The main branch may contain breaking changes.
Monitor Deployment Progress
kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
kubectl get pods -n kubeflow-user-example-com
# Or watch everything at once
kubectl get pods -A --watch💡 All pods should reach Running or Completed status within 10–20 minutes. If a pod is stuck in ImagePullBackOff, check your internet connectivity and DNS resolution.
Access the Kubeflow Dashboard
Quick Access via Port-Forward
kubectl port-forward svc/istio-ingressgateway \
-n istio-system 8080:80 --address 0.0.0.0 &Access the dashboard at http://YOUR_VPS_IP:8080. Default credentials: user@example.com / 12341234
⚠️ Warning: Change the default credentials immediately after your first login. See the Security Hardening section below.
Persistent Access with systemd
[Unit]
Description=Kubeflow Istio Gateway Port Forward
After=k3s.service
Requires=k3s.service
[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/kubectl port-forward \
svc/istio-ingressgateway -n istio-system \
8080:80 --address 0.0.0.0
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable --now kubeflow-gatewayTLS with Nginx Reverse Proxy (Recommended)
sudo apt install -y nginx certbot python3-certbot-nginxserver {
listen 80;
server_name kubeflow.yourdomain.com;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400;
}
}sudo ln -sf /etc/nginx/sites-available/kubeflow /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx
# Obtain TLS certificate
sudo certbot --nginx -d kubeflow.yourdomain.comPost-Installation Configuration
Change Default Credentials
# Generate a new bcrypt hash for your password
python3 -c "import bcrypt; print(bcrypt.hashpw(\
b'YOUR_SECURE_PASSWORD', bcrypt.gensalt()).decode())"
# Edit the Dex config
kubectl edit configmap dex -n auth
# Find the staticPasswords section and replace:
# hash: (old hash)
# With your new bcrypt hash
# email: admin@yourdomain.com
# Restart Dex to apply changes
kubectl rollout restart deployment dex -n authConfigure Storage for Notebooks
kubectl get storageclass
# Should show local-path (default) — works out of the box with k3s
# For larger datasets, consider attaching additional block storage from RamNodeResource Quotas
Set resource limits to prevent any single notebook or pipeline from consuming all resources:
kubectl apply -f - <<EOF
apiVersion: v1
kind: ResourceQuota
metadata:
name: user-quota
namespace: kubeflow-user-example-com
spec:
hard:
requests.cpu: "4"
requests.memory: 16Gi
limits.cpu: "6"
limits.memory: 24Gi
persistentvolumeclaims: "10"
EOF💡 Adjust these values based on your RamNode VPS plan. For a 32GB VPS, you can safely double these limits.
Security Hardening
Firewall Configuration
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow ssh
sudo ufw allow 80/tcp # HTTP (redirects to HTTPS)
sudo ufw allow 443/tcp # HTTPS
sudo ufw allow 6443/tcp # Kubernetes API (restrict to your IP)
sudo ufw enable
# Restrict Kubernetes API to your IP only
sudo ufw delete allow 6443/tcp
sudo ufw allow from YOUR_IP to any port 6443 proto tcpNetwork Policies
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-external-egress
namespace: kubeflow-user-example-com
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
- ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
- port: 443
protocol: TCP
EOFEnable Audit Logging
sudo mkdir -p /var/lib/rancher/k3s/server/audit
sudo tee /var/lib/rancher/k3s/server/audit/policy.yaml << 'EOF'
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
resources:
- group: ""
resources: ["secrets", "configmaps"]
- level: RequestResponse
users: ["system:anonymous"]
- level: None
resources:
- group: ""
resources: ["events"]
EOFYour First ML Pipeline
Verify your deployment by creating a notebook server and running a sample ML pipeline.
Create a Notebook Server
- Log in to the Kubeflow Dashboard
- Navigate to Notebooks in the left sidebar
- Click New Notebook — configure with name
test-notebook, imagekubeflownotebookswg/jupyter-scipy:v1.9.0, 1 CPU, 2Gi memory, 10Gi storage - Click Launch and wait for the notebook to start, then click Connect
Run a Sample Pipeline
# Install the KFP SDK
!pip install kfp==2.7.0
import kfp
from kfp import dsl
@dsl.component(base_image='python:3.11-slim')
def train_model(data_size: int) -> str:
import random
accuracy = 0.7 + random.random() * 0.25
print(f'Trained model with accuracy: {accuracy:.4f}')
return f'Model accuracy: {accuracy:.4f}'
@dsl.component(base_image='python:3.11-slim')
def evaluate_model(result: str) -> str:
print(f'Evaluation complete: {result}')
return f'Evaluation: PASSED - {result}'
@dsl.pipeline(name='ramnode-test-pipeline')
def test_pipeline():
train = train_model(data_size=1000)
evaluate_model(result=train.output)
# Compile and submit
kfp.compiler.Compiler().compile(test_pipeline, 'pipeline.yaml')
client = kfp.Client()
run = client.create_run_from_pipeline_package(
'pipeline.yaml',
arguments={},
run_name='first-run'
)Monitor Your Pipeline
- Navigate to Runs in the Kubeflow Dashboard
- Click on your pipeline run to view the DAG visualization
- Click individual steps to view logs, inputs, and outputs
- Verify both steps complete with green checkmarks
Performance Tuning for RamNode VPS
Storage I/O Optimization
RamNode's NVMe storage delivers excellent baseline performance. Optimize further with kernel parameters:
sudo tee -a /etc/sysctl.d/99-kubeflow-tuning.conf << 'EOF'
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.swappiness = 0
vm.vfs_cache_pressure = 50
EOF
sudo sysctl --systemContainer Image Caching
sudo k3s ctr images pull docker.io/kubeflownotebookswg/jupyter-scipy:v1.9.0
sudo k3s ctr images pull docker.io/kubeflownotebookswg/jupyter-pytorch-full:v1.9.0
sudo k3s ctr images pull docker.io/kubeflownotebookswg/jupyter-tensorflow-full:v1.9.0Memory Management
# Edit k3s service to add kubelet args
sudo systemctl edit k3s
# Add under [Service]:
# Environment="K3S_KUBELET_ARG=
# --system-reserved=cpu=500m,memory=1Gi
# --kube-reserved=cpu=500m,memory=1Gi"
sudo systemctl daemon-reload
sudo systemctl restart k3sMaintenance & Operations
Backup Strategy
# Backup all Kubeflow resources
mkdir -p ~/kubeflow-backups/$(date +%Y%m%d)
for ns in kubeflow kubeflow-user-example-com istio-system \
cert-manager auth knative-serving; do
kubectl get all -n $ns -o yaml > \
~/kubeflow-backups/$(date +%Y%m%d)/$ns-resources.yaml
done
# Backup PersistentVolumeClaims (user data)
kubectl get pvc -A -o yaml > \
~/kubeflow-backups/$(date +%Y%m%d)/all-pvcs.yaml
# Schedule daily backups via cron
echo '0 2 * * * /home/$USER/scripts/backup-kubeflow.sh' | crontab -Monitoring with kubectl
# Check cluster health
kubectl get nodes
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20
# Check Kubeflow component health
kubectl get pods -n kubeflow | grep -v Running
# View recent events for troubleshooting
kubectl get events -n kubeflow --sort-by=.metadata.creationTimestamp | tail -20Upgrading Kubeflow
- Back up your current deployment (see above)
- Review the Kubeflow release notes for breaking changes
- Update the manifests repository:
git fetch && git checkout v{NEW_VERSION}-branch - Apply the updated manifests with the same kustomize command from the install step
- Monitor pod status and verify all components restart successfully
Troubleshooting
Pods stuck in Pending
Insufficient resources. Check node resources with kubectl describe node. Consider upgrading your RamNode VPS plan.
ImagePullBackOff
Network or registry issue. Verify DNS resolution with nslookup registry-1.docker.io. Check /etc/resolv.conf points to a working nameserver.
Dashboard unreachable
Port-forward or Istio issue. Restart the gateway service: sudo systemctl restart kubeflow-gateway. Check Istio pods: kubectl get pods -n istio-system.
Notebook won't start
PVC or resource quota issue. Check events: kubectl describe pod -n kubeflow-user-example-com. Verify storage is available.
Pipeline runs fail
Service account permissions issue. Check pipeline runner SA: kubectl describe sa -n kubeflow-user-example-com. Review pod logs for auth errors.
OOMKilled pods
Insufficient memory. Increase VPS RAM or reduce resource requests. Check which pods consume the most: kubectl top pods -A --sort-by=memory.
Useful Diagnostic Commands
# Full cluster diagnostics
kubectl cluster-info dump > cluster-dump.txt
# Pod logs for a specific component
kubectl logs -n kubeflow deployment/ml-pipeline -c ml-pipeline-api-server
# Describe a failing pod
kubectl describe pod POD_NAME -n kubeflow
# Check disk usage
df -h
sudo k3s ctr images list | wc -l
# Clean unused images to free space
sudo k3s crictl rmi --pruneKubeflow ML Platform Deployed Successfully!
Your Kubeflow machine learning platform is now running in production on a RamNode VPS with K3s, Istio ingress, Kubeflow Pipelines, Notebooks, and SSL encryption. RamNode's NVMe-backed VPS infrastructure provides the high-speed I/O and generous RAM that ML workloads demand — starting at just $4/month.
