What Is BentoML?
BentoML is an open-source Python framework for building production-ready AI inference APIs. It lets you turn any machine learning model — from text summarization to image generation — into a scalable REST API with just a few lines of code. BentoML handles dynamic batching, model parallelism, Docker containerization, and environment management out of the box.
By deploying BentoML on a RamNode VPS, you maintain full control over your infrastructure, data, and costs — no vendor lock-in to managed ML platforms. This makes it ideal for developers, startups, and AI-focused teams who want affordable, high-performance model serving.
Prerequisites
Recommended VPS Specifications
| Requirement | Details |
|---|---|
| VPS Plan | RamNode Cloud VPS — 4 GB RAM minimum (8 GB+ for larger models) |
| Operating System | Ubuntu 24.04 LTS |
| CPU | 2+ vCPUs (4+ recommended for inference workloads) |
| Storage | 40 GB+ SSD (model artifacts can be large) |
| Python | 3.9 or higher (3.11 recommended) |
| Network | Public IPv4 address with SSH access |
💡 RamNode Pricing Advantage: RamNode Cloud VPS plans start at just $4/month with $500 in annual credits — significantly cheaper than managed ML platforms that charge per-inference or per-GPU-hour.
What You'll Need
- A provisioned RamNode VPS with Ubuntu 24.04 LTS installed
- SSH access to your server (root or sudo user)
- A domain name (optional, for HTTPS/reverse proxy)
- Basic familiarity with Python and the Linux command line
Initial Server Setup
Connect and Update
ssh root@YOUR_SERVER_IP
apt update && apt upgrade -y
reboot # if kernel was updatedCreate a Dedicated User
Running services as root is a security risk. Create a dedicated user for BentoML:
adduser bentoml
usermod -aG sudo bentoml
su - bentomlConfigure the Firewall
sudo ufw allow OpenSSH
sudo ufw allow 3000/tcp # BentoML default port
sudo ufw enable
sudo ufw status⚠️ Security Note: Port 3000 should only be exposed directly during development. In production, place BentoML behind a reverse proxy (Nginx) with HTTPS — covered in the Production Hardening section.
Python Environment Setup
Install Python 3.11 and Dependencies
Ubuntu 24.04 ships with Python 3.12, but BentoML works best with Python 3.11:
sudo apt install -y python3.11 python3.11-venv python3.11-dev \
build-essential curl gitCreate a Virtual Environment
python3.11 -m venv ~/bentoml-env
source ~/bentoml-env/bin/activate
# Verify Python version
python --version # Should output Python 3.11.x
# Auto-activate on login
echo "source ~/bentoml-env/bin/activate" >> ~/.bashrcInstall BentoML
pip install --upgrade pip
pip install bentoml
# Verify the installation
bentoml --version
bentoml envBuild Your First BentoML Service
Create the Project Directory
mkdir ~/my-bento-service && cd ~/my-bento-serviceWrite the Service
Create service.py with a text summarization service powered by Hugging Face Transformers:
# service.py
import bentoml
@bentoml.service(
image=bentoml.images.Image(python_version="3.11")
.python_packages("torch", "transformers"),
resources={"cpu": "2"},
traffic={"timeout": 120},
)
class Summarization:
def __init__(self) -> None:
import torch
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
self.pipeline = pipeline(
'summarization',
model="sshleifer/distilbart-cnn-12-6",
device=device,
)💡 Model Choice: The distilbart-cnn-12-6 model is lightweight (~1.2 GB) and runs well on CPU — perfect for testing on a standard RamNode VPS. For GPU-accelerated inference, consider a plan with NVIDIA GPU support.
Install Model Dependencies
pip install torch transformersServe Locally
bentoml serve service:Summarization
# Serves at http://localhost:3000
# Swagger UI at http://localhost:3000/docsTest the API
curl -X POST http://localhost:3000/summarize \
-H "Content-Type: application/json" \
-d '{"texts": ["BentoML is a Python library for building online serving systems optimized for AI apps and model inference. It lets you easily build APIs for any AI/ML model and simplifies Docker container management."]}'import bentoml
with bentoml.SyncHTTPClient('http://localhost:3000') as client:
result = client.summarize([
"BentoML simplifies deploying ML models to production."
])
print(result)Containerize with Docker
BentoML can package your service into a portable Docker image — ideal for reproducible deployments and scaling.
Install Docker
sudo apt install -y docker.io
sudo systemctl enable --now docker
sudo usermod -aG docker bentoml
newgrp dockerCreate the Bento Build File
# bentofile.yaml
service: 'service:Summarization'
labels:
owner: my-team
project: summarization-api
include:
- '*.py'
python:
packages:
- torch
- transformers
docker:
python_version: "3.11"
distro: debianBuild and Containerize
# Build the Bento
bentoml build
# List available Bentos
bentoml list
# Containerize into a Docker image
bentoml containerize summarization:latest
# Verify the image
docker images | grep summarizationRun the Docker Container
docker run -d \
--name bentoml-summarization \
-p 3000:3000 \
--restart unless-stopped \
summarization:latestSystemd Service (Non-Docker)
If you prefer running BentoML directly without Docker, create a systemd unit file for automatic startup and process management:
[Unit]
Description=BentoML Inference Server
After=network.target
[Service]
Type=simple
User=bentoml
WorkingDirectory=/home/bentoml/my-bento-service
Environment=PATH=/home/bentoml/bentoml-env/bin:/usr/bin:/bin
ExecStart=/home/bentoml/bentoml-env/bin/bentoml serve service:Summarization --host 0.0.0.0 --port 3000
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable --now bentoml
sudo systemctl status bentomlProduction Hardening
Nginx Reverse Proxy with SSL
Never expose BentoML directly to the internet in production. Set up Nginx as a reverse proxy with Let's Encrypt SSL:
sudo apt install -y nginx certbot python3-certbot-nginxserver {
listen 80;
server_name api.yourdomain.com;
location / {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support for streaming
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# Increase timeouts for model inference
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}sudo ln -s /etc/nginx/sites-available/bentoml /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx
# Obtain SSL certificate
sudo certbot --nginx -d api.yourdomain.com
# Lock down the firewall for production
sudo ufw delete allow 3000/tcp
sudo ufw allow 'Nginx Full'API Authentication
Add a simple API key middleware for basic access control. Create auth.py:
# auth.py
import os
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import JSONResponse
API_KEY = os.environ.get('BENTOML_API_KEY', 'change-me')
class APIKeyMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
if request.url.path in ('/healthz', '/docs', '/schema'):
return await call_next(request)
api_key = request.headers.get('Authorization', '').replace('Bearer ', '')
if api_key != API_KEY:
return JSONResponse(
status_code=401,
content={"error": "Invalid API key"}
)
return await call_next(request)Resource Limits
Configure BentoML's resource management to prevent OOM errors:
@bentoml.service(
resources={
"cpu": "2",
"memory": "2Gi",
},
traffic={
"timeout": 120,
"max_concurrency": 4,
},
)For Docker deployments, set container memory limits:
docker run -d \
--name bentoml-summarization \
--memory=3g --memory-swap=4g \
-p 3000:3000 \
--restart unless-stopped \
summarization:latestHealth Checks & Log Management
BentoML exposes a built-in health check endpoint at /healthz. Use it for monitoring and load balancer health checks:
curl http://localhost:3000/healthz
# Returns: {"status": "ok"}View Logs
# Systemd service logs
journalctl -u bentoml -f --no-pager
# Docker container logs
docker logs -f bentoml-summarizationUpdating & Redeploying
When you update your model or service code, follow this workflow:
- Update your
service.pyor model dependencies - Rebuild the Bento:
bentoml build - Rebuild the Docker image:
bentoml containerize summarization:latest - Stop the old container:
docker stop bentoml-summarization && docker rm bentoml-summarization - Run the new container with the same
docker runcommand from Step 6
For zero-downtime deployments, consider running two containers behind your Nginx reverse proxy and switching traffic between them.
Troubleshooting
Out of memory during model load
Upgrade to a VPS with more RAM (8 GB+) or use a smaller model variant.
Port 3000 already in use
Check with sudo lsof -i :3000 — kill the process or change the port.
bentoml command not found
Ensure your virtual environment is activated: source ~/bentoml-env/bin/activate
Slow first request
Normal — the model loads into memory on the first request. Subsequent requests will be much faster.
Docker build fails
Ensure Docker has enough disk space: docker system prune -a to clean up old images.
SSL certificate renewal fails
Verify your domain's DNS points to the VPS IP and port 80 is accessible.
BentoML Deployed Successfully!
Your BentoML AI inference API is now running in production on a RamNode VPS with Docker containerization, Nginx reverse proxy, SSL encryption, and API authentication.
Next Steps:
- Multi-model pipelines: Chain multiple models in a single service for RAG or multi-stage inference
- GPU inference: Deploy on a GPU-equipped VPS for compute-heavy models (LLMs, Stable Diffusion)
- Custom Docker images: Use BentoML's image API for optimized containers with specific CUDA versions
- Horizontal scaling: Run multiple BentoML containers behind a load balancer for high-throughput workloads
