AI / ML Infrastructure Guide

    Deploy KServe

    Deploy scalable ML model inference on Kubernetes with KServe, K3s, Envoy Gateway, and Cert Manager on RamNode's VPS hosting.

    K3s Kubernetes
    KServe v0.15+
    Envoy Gateway
    Cert Manager
    1

    Prerequisites

    Choose your RamNode plan based on your workload type. Predictive inference (scikit-learn, XGBoost) runs well on mid-range VPS plans, while generative AI and LLM serving require more resources.

    ResourcePredictive AISmall LLMsLarge LLMs
    CPU Cores4+8+16+
    RAM8 GB32 GB64+ GB
    Storage40 GB SSD100 GB NVMe200+ GB NVMe
    OSUbuntu 24.04Ubuntu 24.04Ubuntu 24.04

    Software Prerequisites

    • curl — pre-installed on Ubuntu 24.04
    • kubectl — Kubernetes CLI for cluster management
    • Helm 3 — Kubernetes package manager
    • A running Kubernetes cluster — we will install K3s in this guide
    2

    Initial Server Setup

    Update system and install essentials
    apt update && apt upgrade -y
    apt install -y curl wget git apt-transport-https ca-certificates

    Configure Firewall

    Open required ports
    ufw allow 22/tcp        # SSH
    ufw allow 6443/tcp      # Kubernetes API
    ufw allow 80/tcp        # HTTP inference endpoints
    ufw allow 443/tcp       # HTTPS inference endpoints
    ufw allow 8080/tcp      # KServe model endpoints
    ufw allow 10250/tcp     # Kubelet
    ufw --force enable

    Set Kernel Parameters

    Configure networking for Kubernetes
    cat <<EOF >> /etc/sysctl.d/99-kubernetes.conf
    net.bridge.bridge-nf-call-iptables = 1
    net.bridge.bridge-nf-call-ip6tables = 1
    net.ipv4.ip_forward                = 1
    EOF
    
    modprobe br_netfilter
    sysctl --system
    3

    Install K3s Kubernetes

    K3s is a lightweight, certified Kubernetes distribution ideal for VPS deployments. It bundles containerd, CoreDNS, and other essentials in a single binary. We disable Traefik because KServe manages traffic routing through Gateway API.

    Install K3s
    curl -sfL https://get.k3s.io | sh -s - \
      --write-kubeconfig-mode 644 \
      --disable traefik
    Verify installation
    kubectl get nodes
    # Expected output:
    # NAME      STATUS   ROLES                  AGE   VERSION
    # ramnode   Ready    control-plane,master   1m    v1.32.x
    Configure kubectl
    mkdir -p ~/.kube
    cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
    export KUBECONFIG=~/.kube/config
    echo "export KUBECONFIG=~/.kube/config" >> ~/.bashrc
    4

    Install Helm

    Install Helm 3
    curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
    helm version
    5

    Install Cert Manager

    Cert Manager handles TLS certificate provisioning for KServe's webhook server. The minimum required version is 1.15.0.

    Install Cert Manager
    kubectl apply -f \
      https://github.com/cert-manager/cert-manager/releases/download/v1.16.3/cert-manager.yaml
    
    # Wait for pods to be ready
    kubectl wait --for=condition=ready pod -l app=cert-manager \
      -n cert-manager --timeout=120s
    Verify Cert Manager
    kubectl get pods -n cert-manager
    # All three pods should show Running status:
    #   cert-manager
    #   cert-manager-cainjector
    #   cert-manager-webhook
    6

    Install Gateway API & Envoy Gateway

    KServe recommends Gateway API for network configuration, providing flexible and standardized traffic management — especially important for generative AI workloads with streaming responses and long-lived connections.

    Install Gateway API CRDs
    kubectl apply -f \
      https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml
    Install Envoy Gateway
    helm install eg oci://docker.io/envoyproxy/gateway-helm \
      --version v1.3.0 \
      -n envoy-gateway-system \
      --create-namespace
    
    # Wait for Envoy Gateway to be ready
    kubectl wait --for=condition=ready pod \
      -l app.kubernetes.io/name=gateway-helm \
      -n envoy-gateway-system --timeout=120s
    Create GatewayClass
    cat <<EOF | kubectl apply -f -
    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: envoy
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    EOF
    7

    Install KServe

    KServe supports three deployment modes. For RamNode VPS deployments, we recommend Standard mode for full resource control.

    ModeBest ForNotes
    Standard (recommended)GPU models, LLMs, all workloadsFull resource control
    Knative (Serverless)Burst traffic, scale-to-zeroHigher overhead
    ModelMeshHigh-density multi-modelMany small models

    Install with Helm (Recommended)

    Install KServe CRDs and controller
    # Install KServe CRDs
    helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
      --version v0.15.2
    
    # Install KServe in Standard deployment mode
    helm install kserve oci://ghcr.io/kserve/charts/kserve \
      --version v0.15.2 \
      --set kserve.controller.deploymentMode=Standard \
      --set kserve.controller.gateway.ingressGateway.enableGatewayApi=true \
      --set kserve.controller.gateway.ingressGateway.kserveGateway=kserve/kserve-ingress-gateway

    Alternative: Install with kubectl

    Install via raw manifests
    # Install KServe CRDs and Controller
    kubectl apply --server-side -f \
      https://github.com/kserve/kserve/releases/download/v0.15.2/kserve.yaml
    
    # Install built-in ClusterServingRuntimes
    kubectl apply --server-side -f \
      https://github.com/kserve/kserve/releases/download/v0.15.2/kserve-cluster-resources.yaml
    
    # Patch config for Standard deployment mode
    kubectl patch configmap/inferenceservice-config -n kserve \
      --type=strategic -p '{"data": {"deploy": "{"defaultDeploymentMode": "Standard"}"}}'

    The --server-side flag is required when applying KServe CRDs due to the large size of the InferenceService custom resource definition.

    Verify KServe installation
    kubectl get pods -n kserve
    # Expected output:
    # NAME                            READY   STATUS    AGE
    # kserve-controller-manager-0     2/2     Running   2m
    
    kubectl get clusterservingruntimes
    # Should list built-in runtimes: kserve-huggingfaceserver,
    # kserve-sklearnserver, kserve-xgbserver, kserve-torchserve, etc.
    8

    Deploy Your First Models

    Example 1: Scikit-Learn Iris Classifier

    The simplest way to verify your KServe installation. This deploys a pre-trained scikit-learn model for iris flower classification.

    Deploy sklearn-iris InferenceService
    cat <<EOF | kubectl apply -f -
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: sklearn-iris
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          protocolVersion: v2
          storageUri: gs://kfserving-examples/models/sklearn/1.0/model
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
            limits:
              cpu: "1"
              memory: 2Gi
    EOF
    Check status and test inference
    # Wait for READY=True
    kubectl get inferenceservice sklearn-iris
    
    # Create test input
    cat <<EOF > /tmp/iris-input.json
    {"instances": [[6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6]]}
    EOF
    
    # Get the service URL
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris \
      -o jsonpath='{.status.url}' | cut -d/ -f3)
    
    # Send prediction request
    curl -v -H "Host: ${SERVICE_HOSTNAME}" \
      http://localhost:8080/v1/models/sklearn-iris:predict \
      -d @/tmp/iris-input.json

    Example 2: Hugging Face LLM (Qwen 2.5)

    For generative AI, KServe provides native Hugging Face integration with OpenAI-compatible chat completion APIs.

    Deploy Qwen 2.5 LLM
    cat <<EOF | kubectl apply -f -
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: qwen-llm
    spec:
      predictor:
        model:
          modelFormat:
            name: huggingface
          args:
            - "--model_id"
            - "Qwen/Qwen2.5-0.5B-Instruct"
          resources:
            requests:
              cpu: "4"
              memory: 8Gi
            limits:
              cpu: "4"
              memory: 8Gi
    EOF
    Test chat completions
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen-llm \
      -o jsonpath='{.status.url}' | cut -d/ -f3)
    
    curl -v -H "Host: ${SERVICE_HOSTNAME}" \
      http://localhost:8080/openai/v1/chat/completions \
      -d '{
        "model": "qwen-llm",
        "messages": [{"role": "user", "content": "Hello, what can you do?"}],
        "max_tokens": 100
      }'

    🚀 GPU-Accelerated LLMs

    For larger models like Llama 3.1 8B, add nvidia.com/gpu: "1" to your resource limits and ensure your RamNode VPS has GPU support configured. KServe v0.15 also supports multi-node inference for models that exceed single-GPU memory.

    9

    Production Configuration

    Configure Autoscaling

    KServe supports request-based autoscaling in Knative mode and metric-based autoscaling via KEDA in Standard mode.

    Set min/max replicas on InferenceService
    cat <<EOF | kubectl apply -f -
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: sklearn-iris
      annotations:
        serving.kserve.io/min-replicas: "1"
        serving.kserve.io/max-replicas: "5"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: gs://kfserving-examples/models/sklearn/1.0/model
    EOF

    Enable Model Caching

    KServe v0.15 introduced LocalModelCache, reducing model loading times from 15–20 minutes to approximately one minute for large models.

    Create LocalModelCache
    cat <<EOF | kubectl apply -f -
    apiVersion: serving.kserve.io/v1alpha1
    kind: LocalModelCache
    metadata:
      name: llm-cache
    spec:
      sourceModelUri: hf://Qwen/Qwen2.5-0.5B-Instruct
      nodeGroups:
        - default
    EOF

    Resource Defaults

    Customize global resource defaults for all InferenceServices to prevent models from exceeding your VPS limits.

    Edit KServe ConfigMap
    kubectl edit configmap inferenceservice-config -n kserve
    
    # Key settings in the inferenceService section:
    # "cpuLimit":     "2"    — Max CPU per model container
    # "memoryLimit":  "4Gi"  — Max memory per model container
    # "cpuRequest":   "1"    — Default CPU request
    # "memoryRequest":"2Gi"  — Default memory request
    10

    Monitoring & Troubleshooting

    Health Checks

    Monitor KServe status
    # Check overall KServe status
    kubectl get inferenceservices --all-namespaces
    
    # View detailed model status
    kubectl describe inferenceservice sklearn-iris
    
    # Check KServe controller logs
    kubectl logs -n kserve -l control-plane=kserve-controller-manager -c manager

    Common Issues

    ProblemCauseSolution
    InferenceService stuck in UnknownStorage init container failingCheck storage-initializer logs
    CrashLoopBackOff on predictorInsufficient memoryIncrease memory limits
    Model download timeoutSlow network or large modelUse LocalModelCache
    502 Bad GatewayModel not yet readyWait for READY=True status
    Webhook errors during installCert Manager not readyVerify cert-manager pods

    Debug Commands

    Useful debugging commands
    # Check pod events for errors
    kubectl get events --sort-by=.lastTimestamp
    
    # Inspect storage init container logs
    kubectl logs <pod-name> -c storage-initializer
    
    # Check Gateway resources
    kubectl get gateways,httproutes -A
    
    # View model server logs
    kubectl logs <pod-name> -c kserve-container
    11

    Supported ML Frameworks

    KServe provides built-in serving runtimes for all major machine learning frameworks. Each runtime supports the V2 inference protocol.

    FrameworkModel FormatUse Case
    Scikit-LearnsklearnClassification, regression, clustering
    XGBoostxgboostGradient boosting models
    TensorFlowtensorflowDeep learning models
    PyTorch (TorchServe)pytorchCustom deep learning
    ONNXonnxruntimeCross-framework optimized
    Hugging FacehuggingfaceLLMs, NLP, transformers
    vLLMvllmHigh-throughput LLM serving
    Custom ContainerscustomAny framework
    12

    Canary Deployments

    KServe supports canary rollouts for safely deploying model updates. Split traffic between current and new model versions, gradually shifting as you gain confidence.

    Canary deployment with 20% traffic split
    cat <<EOF | kubectl apply -f -
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: sklearn-iris
    spec:
      predictor:
        canaryTrafficPercent: 20
        model:
          modelFormat:
            name: sklearn
          storageUri: gs://kfserving-examples/models/sklearn/2.0/model
    EOF

    This routes 20% of traffic to the new model version while 80% continues serving from the stable revision. Increase the canary percentage incrementally as you validate performance.

    13

    Cleanup

    Remove models and uninstall KServe
    # Delete InferenceServices
    kubectl delete inferenceservice --all
    
    # Uninstall KServe
    helm uninstall kserve
    helm uninstall kserve-crd
    
    # Uninstall Envoy Gateway
    helm uninstall eg -n envoy-gateway-system
    
    # Uninstall Cert Manager
    kubectl delete -f \
      https://github.com/cert-manager/cert-manager/releases/download/v1.16.3/cert-manager.yaml

    KServe Deployed Successfully!

    Your KServe environment is now running on a RamNode VPS with K3s Kubernetes, Envoy Gateway traffic management, Cert Manager TLS, and production-ready model serving. RamNode's flexible VPS infrastructure provides the compute resources that ML inference workloads demand — starting at just $4/month.