BentoML Deployment Guide on RamNode VPS

What Is BentoML?

BentoML is an open-source Python framework for building production-ready AI inference APIs. It lets you turn any machine learning model — from text summarization to image generation — into a scalable REST API with just a few lines of code. BentoML handles dynamic batching, model parallelism, Docker containerization, and environment management out of the box.

By deploying BentoML on a RamNode VPS, you maintain full control over your infrastructure, data, and costs — no vendor lock-in to managed ML platforms. This makes it ideal for developers, startups, and AI-focused teams who want affordable, high-performance model serving.

Prerequisites

Recommended VPS Specifications

Requirement	Details
VPS Plan	RamNode Cloud VPS — 4 GB RAM minimum (8 GB+ for larger models)
Operating System	Ubuntu 24.04 LTS
CPU	2+ vCPUs (4+ recommended for inference workloads)
Storage	40 GB+ SSD (model artifacts can be large)
Python	3.9 or higher (3.11 recommended)
Network	Public IPv4 address with SSH access

💡 RamNode Pricing Advantage: RamNode Cloud VPS plans start at just $4/month with $500 in annual credits — significantly cheaper than managed ML platforms that charge per-inference or per-GPU-hour.

What You'll Need

A provisioned RamNode VPS with Ubuntu 24.04 LTS installed
SSH access to your server (root or sudo user)
A domain name (optional, for HTTPS/reverse proxy)
Basic familiarity with Python and the Linux command line

Initial Server Setup

Connect and Update

SSH and update system

ssh root@YOUR_SERVER_IP
apt update && apt upgrade -y
reboot      # if kernel was updated

Create a Dedicated User

Running services as root is a security risk. Create a dedicated user for BentoML:

Create bentoml user

adduser bentoml
usermod -aG sudo bentoml
su - bentoml

Configure the Firewall

Enable UFW

sudo ufw allow OpenSSH
sudo ufw allow 3000/tcp   # BentoML default port
sudo ufw enable
sudo ufw status

⚠️ Security Note: Port 3000 should only be exposed directly during development. In production, place BentoML behind a reverse proxy (Nginx) with HTTPS — covered in the Production Hardening section.

Python Environment Setup

Install Python 3.11 and Dependencies

Ubuntu 24.04 ships with Python 3.12, but BentoML works best with Python 3.11:

Install Python 3.11

sudo apt install -y python3.11 python3.11-venv python3.11-dev \
  build-essential curl git

Create a Virtual Environment

Set up venv

python3.11 -m venv ~/bentoml-env
source ~/bentoml-env/bin/activate

# Verify Python version
python --version    # Should output Python 3.11.x

# Auto-activate on login
echo "source ~/bentoml-env/bin/activate" >> ~/.bashrc

Install BentoML

Install from PyPI

pip install --upgrade pip
pip install bentoml

# Verify the installation
bentoml --version
bentoml env

Build Your First BentoML Service

Create the Project Directory

Scaffold project

mkdir ~/my-bento-service && cd ~/my-bento-service

Write the Service

Create service.py with a text summarization service powered by Hugging Face Transformers:

service.py

# service.py
import bentoml

@bentoml.service(
    image=bentoml.images.Image(python_version="3.11")
        .python_packages("torch", "transformers"),
    resources={"cpu": "2"},
    traffic={"timeout": 120},
)
class Summarization:
    def __init__(self) -> None:
        import torch
        from transformers import pipeline

        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.pipeline = pipeline(
            'summarization',
            model="sshleifer/distilbart-cnn-12-6",
            device=device,
        )

💡 Model Choice: The distilbart-cnn-12-6 model is lightweight (~1.2 GB) and runs well on CPU — perfect for testing on a standard RamNode VPS. For GPU-accelerated inference, consider a plan with NVIDIA GPU support.

Install Model Dependencies

Install ML libraries

pip install torch transformers

Serve Locally

Start the dev server

bentoml serve service:Summarization
# Serves at http://localhost:3000
# Swagger UI at http://localhost:3000/docs

Test the API

Test with curl

curl -X POST http://localhost:3000/summarize \
  -H "Content-Type: application/json" \
  -d '{"texts": ["BentoML is a Python library for building online serving systems optimized for AI apps and model inference. It lets you easily build APIs for any AI/ML model and simplifies Docker container management."]}'

Test with Python client

import bentoml

with bentoml.SyncHTTPClient('http://localhost:3000') as client:
    result = client.summarize([
        "BentoML simplifies deploying ML models to production."
    ])
    print(result)

Containerize with Docker

BentoML can package your service into a portable Docker image — ideal for reproducible deployments and scaling.

Install Docker

sudo apt install -y docker.io
sudo systemctl enable --now docker
sudo usermod -aG docker bentoml
newgrp docker

Create the Bento Build File

bentofile.yaml

# bentofile.yaml
service: 'service:Summarization'
labels:
  owner: my-team
  project: summarization-api
include:
  - '*.py'
python:
  packages:
    - torch
    - transformers
docker:
  python_version: "3.11"
  distro: debian

Build and Containerize

Build Bento and Docker image

# Build the Bento
bentoml build

# List available Bentos
bentoml list

# Containerize into a Docker image
bentoml containerize summarization:latest

# Verify the image
docker images | grep summarization

Run the Docker Container

Run container

docker run -d \
  --name bentoml-summarization \
  -p 3000:3000 \
  --restart unless-stopped \
  summarization:latest

Systemd Service (Non-Docker)

If you prefer running BentoML directly without Docker, create a systemd unit file for automatic startup and process management:

Create /etc/systemd/system/bentoml.service

[Unit]
Description=BentoML Inference Server
After=network.target

[Service]
Type=simple
User=bentoml
WorkingDirectory=/home/bentoml/my-bento-service
Environment=PATH=/home/bentoml/bentoml-env/bin:/usr/bin:/bin
ExecStart=/home/bentoml/bentoml-env/bin/bentoml serve service:Summarization --host 0.0.0.0 --port 3000
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start

sudo systemctl daemon-reload
sudo systemctl enable --now bentoml
sudo systemctl status bentoml

Production Hardening

Nginx Reverse Proxy with SSL

Never expose BentoML directly to the internet in production. Set up Nginx as a reverse proxy with Let's Encrypt SSL:

Install Nginx and Certbot

sudo apt install -y nginx certbot python3-certbot-nginx

Create /etc/nginx/sites-available/bentoml

server {
    listen 80;
    server_name api.yourdomain.com;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support for streaming
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Increase timeouts for model inference
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

Enable site and obtain SSL

sudo ln -s /etc/nginx/sites-available/bentoml /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

# Obtain SSL certificate
sudo certbot --nginx -d api.yourdomain.com

# Lock down the firewall for production
sudo ufw delete allow 3000/tcp
sudo ufw allow 'Nginx Full'

API Authentication

Add a simple API key middleware for basic access control. Create auth.py:

auth.py

# auth.py
import os
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import JSONResponse

API_KEY = os.environ.get('BENTOML_API_KEY', 'change-me')

class APIKeyMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        if request.url.path in ('/healthz', '/docs', '/schema'):
            return await call_next(request)

        api_key = request.headers.get('Authorization', '').replace('Bearer ', '')
        if api_key != API_KEY:
            return JSONResponse(
                status_code=401,
                content={"error": "Invalid API key"}
            )
        return await call_next(request)

Resource Limits

Configure BentoML's resource management to prevent OOM errors:

Service decorator with limits

@bentoml.service(
    resources={
        "cpu": "2",
        "memory": "2Gi",
    },
    traffic={
        "timeout": 120,
        "max_concurrency": 4,
    },
)

For Docker deployments, set container memory limits:

Docker with memory limits

docker run -d \
  --name bentoml-summarization \
  --memory=3g --memory-swap=4g \
  -p 3000:3000 \
  --restart unless-stopped \
  summarization:latest

Health Checks & Log Management

BentoML exposes a built-in health check endpoint at /healthz. Use it for monitoring and load balancer health checks:

Health check

curl http://localhost:3000/healthz
# Returns: {"status": "ok"}

View Logs

Log management

# Systemd service logs
journalctl -u bentoml -f --no-pager

# Docker container logs
docker logs -f bentoml-summarization

Updating & Redeploying

When you update your model or service code, follow this workflow:

Update your service.py or model dependencies
Rebuild the Bento: bentoml build
Rebuild the Docker image: bentoml containerize summarization:latest
Stop the old container: docker stop bentoml-summarization && docker rm bentoml-summarization
Run the new container with the same docker run command from Step 6

For zero-downtime deployments, consider running two containers behind your Nginx reverse proxy and switching traffic between them.

Troubleshooting

Out of memory during model load

Upgrade to a VPS with more RAM (8 GB+) or use a smaller model variant.

Port 3000 already in use

Check with sudo lsof -i :3000 — kill the process or change the port.

bentoml command not found

Ensure your virtual environment is activated: source ~/bentoml-env/bin/activate

Slow first request

Normal — the model loads into memory on the first request. Subsequent requests will be much faster.

Docker build fails

Ensure Docker has enough disk space: docker system prune -a to clean up old images.

SSL certificate renewal fails

Verify your domain's DNS points to the VPS IP and port 80 is accessible.

BentoML Deployed Successfully!

Your BentoML AI inference API is now running in production on a RamNode VPS with Docker containerization, Nginx reverse proxy, SSL encryption, and API authentication.

Next Steps:

Multi-model pipelines: Chain multiple models in a single service for RAG or multi-stage inference
GPU inference: Deploy on a GPU-equipped VPS for compute-heavy models (LLMs, Stable Diffusion)
Custom Docker images: Use BentoML's image API for optimized containers with specific CUDA versions
Horizontal scaling: Run multiple BentoML containers behind a load balancer for high-throughput workloads