Overview
TorchServe is an open-source, production-grade model serving framework built by AWS and Meta for deploying PyTorch models at scale. It provides a robust REST API and gRPC interface for real-time inference, batch processing, model management, and monitoring — all without building custom serving infrastructure.
What You Will Build
- A TorchServe instance running behind a reverse proxy with TLS termination
- A pre-packaged DenseNet-161 image classification model as a working example
- Systemd service management for automatic restarts and boot persistence
- Monitoring and logging infrastructure for production observability
- Firewall rules and security hardening for public-facing inference APIs
Prerequisites
Recommended VPS Specifications
| Component | Minimum | Recommended |
|---|---|---|
| CPU | 2 vCPUs | 4+ vCPUs |
| RAM | 4 GB | 8–16 GB |
| Storage | 40 GB SSD | 80+ GB NVMe SSD |
| OS | Ubuntu 22.04 LTS | Ubuntu 24.04 LTS |
| Network | 1 Gbps | 1 Gbps unmetered |
💡 Sizing Tip: For lightweight models (ResNet, BERT-base), a 4 GB plan works well. For larger models (GPT-2, ViT-Large) or concurrent requests, start with 8 GB+. TorchServe's memory footprint scales directly with model size and worker count.
Software Requirements
- SSH access to your RamNode VPS with root or sudo privileges
- A domain name pointed to your VPS IP (optional, for TLS configuration)
- Basic familiarity with Linux system administration and PyTorch concepts
Initial Server Setup
System Updates and Dependencies
Java is required because TorchServe's model server runs on the JVM. Python 3.10+ is needed for the model handling and archiving tools.
sudo apt update && sudo apt upgrade -y
sudo apt install -y openjdk-17-jdk python3 python3-pip python3-venv \
git wget curl unzip nginx certbot python3-certbot-nginxCreate a Dedicated Service User
Running TorchServe under a dedicated non-root user limits the blast radius of any potential vulnerabilities.
sudo useradd -r -m -d /opt/torchserve -s /bin/bash torchserve
sudo mkdir -p /opt/torchserve/{model-store,logs,config}
sudo chown -R torchserve:torchserve /opt/torchserveInstall TorchServe
Set Up the Python Environment
sudo -u torchserve bash -c '
cd /opt/torchserve
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel
'Install PyTorch and TorchServe
Install the CPU-optimized build. For GPU-equipped plans, substitute the CUDA-enabled variant.
sudo -u torchserve bash -c '
source /opt/torchserve/venv/bin/activate
pip install torch torchvision --index-url \
https://download.pytorch.org/whl/cpu
pip install torchserve torch-model-archiver torch-workflow-archiver
'Verify Installation
sudo -u torchserve bash -c '
source /opt/torchserve/venv/bin/activate
torchserve --version
torch-model-archiver --version
'Both commands should return version numbers. If you see import failures, verify Java 17 is accessible by running java -version.
Package and Register a Model
TorchServe uses the MAR (Model Archive) format to bundle model weights, handler code, and metadata into a single deployable artifact. This example uses DenseNet-161 for image classification.
Download the Pre-trained Model
sudo -u torchserve bash -c '
source /opt/torchserve/venv/bin/activate
cd /opt/torchserve
# Download DenseNet-161 model weights
wget -q https://download.pytorch.org/models/densenet161-8d451a50.pth
# Download ImageNet class labels
wget -q https://raw.githubusercontent.com/pytorch/serve/master/\
examples/image_classifier/index_to_name.json
'Create the Model Archive
sudo -u torchserve bash -c '
source /opt/torchserve/venv/bin/activate
cd /opt/torchserve
torch-model-archiver \
--model-name densenet161 \
--version 1.0 \
--model-file /dev/null \
--serialized-file densenet161-8d451a50.pth \
--handler image_classifier \
--extra-files index_to_name.json \
--export-path model-store/
'This produces model-store/densenet161.mar, the deployable artifact TorchServe will load.
Configure TorchServe
The configuration below is tuned for a RamNode VPS with 4–8 GB RAM, binding APIs to localhost so they're only accessible through the Nginx reverse proxy.
# TorchServe Configuration — RamNode VPS
# Inference API (port 8080, localhost only)
inference_address=http://127.0.0.1:8080
# Management API (port 8081, localhost only)
management_address=http://127.0.0.1:8081
# Metrics API (port 8082, localhost only)
metrics_address=http://127.0.0.1:8082
# Model store directory
model_store=/opt/torchserve/model-store
# Load all models on startup
load_models=all
# Worker configuration
default_workers_per_model=1
job_queue_size=100
# Memory management
vmargs=-Xmx2g -XX:+UseG1GC -XX:MaxGCPauseMillis=100
# Request limits
max_request_size=6553600
max_response_size=6553600
# Logging
async_logging=trueConfiguration Parameters
| Parameter | Value | Purpose |
|---|---|---|
| inference_address | 127.0.0.1:8080 | Binds inference API to localhost for reverse proxy access only |
| management_address | 127.0.0.1:8081 | Restricts model management to local access |
| default_workers_per_model | 1 | One worker per model; increase based on available RAM |
| vmargs -Xmx2g | 2 GB heap | JVM heap limit; set to ~25% of total RAM |
| job_queue_size | 100 | Max queued requests before rejecting new ones |
| max_request_size | 6553600 | 6 MB limit for image uploads |
⚠️ Memory Planning: Each TorchServe worker loads a full copy of the model into memory. A single DenseNet-161 worker requires ~300 MB. Formula: total memory = (model size × workers) + 2 GB JVM overhead + 1 GB OS buffer.
Create a Systemd Service
[Unit]
Description=TorchServe Model Serving
After=network.target
Wants=network-online.target
[Service]
Type=simple
User=torchserve
Group=torchserve
WorkingDirectory=/opt/torchserve
Environment=JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
Environment=PATH=/opt/torchserve/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/torchserve/venv/bin/torchserve \
--start \
--ts-config /opt/torchserve/config/config.properties \
--model-store /opt/torchserve/model-store \
--ncs
ExecStop=/opt/torchserve/venv/bin/torchserve --stop
Restart=on-failure
RestartSec=10
LimitNOFILE=65536
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable torchserve
sudo systemctl start torchserve
# Verify the service is running
sudo systemctl status torchserve
curl -s http://127.0.0.1:8080/pingThe ping endpoint should return {"status": "Healthy"} once models finish loading. Initial startup may take 30–60 seconds.
Nginx Reverse Proxy with TLS
Nginx handles TLS termination, request buffering, rate limiting, and provides a clean API boundary for clients.
upstream torchserve_inference {
server 127.0.0.1:8080;
keepalive 32;
}
server {
listen 80;
server_name your-domain.com;
# Rate limiting zone
limit_req_zone $binary_remote_addr
zone=inference:10m rate=10r/s;
# Inference API
location /predictions/ {
limit_req zone=inference burst=20 nodelay;
proxy_pass http://torchserve_inference;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
proxy_send_timeout 60s;
client_max_body_size 10m;
}
# Health check endpoint
location /ping {
proxy_pass http://torchserve_inference;
}
# Block management API from external access
location /models {
deny all;
return 403;
}
}sudo ln -s /etc/nginx/sites-available/torchserve /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
# Obtain TLS certificate
sudo certbot --nginx -d your-domain.comFirewall Configuration
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow ssh
sudo ufw allow 'Nginx Full'
sudo ufw enable
# Verify rules
sudo ufw status verbose🔒 Security Note: The TorchServe management API (port 8081) and metrics API (port 8082) are bound to localhost and blocked by both UFW and Nginx. For remote management access, use an SSH tunnel: ssh -L 8081:127.0.0.1:8081 user@your-vps-ip
Test Your Deployment
Health Check
curl -s https://your-domain.com/ping | python3 -m json.tool
# Expected: {"status": "Healthy"}Run an Inference Request
# Download a test image
wget -q -O kitten.jpg \
https://raw.githubusercontent.com/pytorch/serve/master/\
examples/image_classifier/kitten.jpg
# Send inference request
curl -X POST https://your-domain.com/predictions/densenet161 \
-T kitten.jpg \
-H 'Content-Type: application/octet-stream' | python3 -m json.toolExpected Response
{
"tabby": 0.4664836823940277,
"tiger_cat": 0.4645617604255676,
"Egyptian_cat": 0.06619937717914581,
"lynx": 0.0012969186063855886,
"plastic_bag": 0.00022856894403230399
}Model Management (localhost only)
# Check model status
curl -s http://127.0.0.1:8081/models | python3 -m json.tool
# Scale workers for higher throughput
curl -X PUT 'http://127.0.0.1:8081/models/densenet161?\
min_worker=2&max_worker=4'Monitoring & Logging
TorchServe Metrics
TorchServe exposes Prometheus-compatible metrics on the metrics API endpoint:
curl -s http://127.0.0.1:8082/metricsKey Metrics to Monitor
| Metric | Description | Alert Threshold |
|---|---|---|
| ts_inference_latency_microsecond | Per-request inference time | > 5000ms p99 |
| ts_queue_latency_microsecond | Time spent in request queue | > 1000ms p99 |
| ts_inference_requests_total | Total requests processed | Trending to zero |
| MemoryUsed | JVM heap consumption | > 85% of Xmx |
| WorkerThreadTime | Worker processing time | Increasing trend |
Journald Log Access
# View live logs
sudo journalctl -u torchserve -f
# View logs from the last hour
sudo journalctl -u torchserve --since '1 hour ago'
# Filter for errors only
sudo journalctl -u torchserve -p errLog Rotation
/opt/torchserve/logs/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
copytruncate
}Deploy Your Own Models
Write a Custom Handler
import torch
import json
from ts.torch_handler.base_handler import BaseHandler
class MyModelHandler(BaseHandler):
def preprocess(self, data):
"""Transform raw input into model-ready tensors."""
inputs = []
for row in data:
input_data = row.get('data') or row.get('body')
if isinstance(input_data, (bytes, bytearray)):
input_data = input_data.decode('utf-8')
parsed = json.loads(input_data)
tensor = torch.tensor(parsed['input'])
inputs.append(tensor)
return torch.stack(inputs)
def postprocess(self, inference_output):
"""Convert model output to API response."""
return inference_output.tolist()Archive and Register
sudo -u torchserve bash -c '
source /opt/torchserve/venv/bin/activate
torch-model-archiver \
--model-name my_model \
--version 1.0 \
--serialized-file /path/to/model.pt \
--handler handlers/custom_handler.py \
--export-path model-store/
'
# Register with the running server (hot-deploy)
curl -X POST 'http://127.0.0.1:8081/models?\
url=my_model.mar&model_name=my_model&initial_workers=1'Performance Tuning
Worker Scaling Guidelines
| VPS Plan | RAM | Workers | Estimated Throughput |
|---|---|---|---|
| Standard 4 GB | 4 GB | 1–2 | 10–20 req/sec |
| Standard 8 GB | 8 GB | 2–4 | 20–50 req/sec |
| Standard 16 GB | 16 GB | 4–8 | 50–120 req/sec |
CPU Optimization
Enable OpenMP and Intel MKL threading for CPU-bound inference:
Environment=OMP_NUM_THREADS=4
Environment=MKL_NUM_THREADS=4
Environment=TORCH_NUM_THREADS=4Request Batching
Dynamic batching improves throughput by combining multiple requests into a single forward pass:
# Dynamic batching
batch_size=4
max_batch_delay=100The max_batch_delay (ms) controls how long TorchServe waits to fill a batch. Lower values reduce latency; higher values improve throughput under load.
Troubleshooting
TorchServe fails to start
Java not found or wrong version. Run java -version and ensure OpenJDK 17 is installed.
curl /ping returns connection refused
Service not running or still loading. Check systemctl status torchserve and wait 30–60 seconds for model loading.
Out of memory errors
Model + JVM exceeds available RAM. Reduce -Xmx, decrease workers, or upgrade your VPS plan.
502 Bad Gateway from Nginx
TorchServe not responding on port 8080. Verify the service is running and check logs with journalctl -u torchserve.
Slow inference (>5s per request)
Insufficient CPU or unoptimized threading. Set OMP/MKL thread counts and enable request batching.
Model not loading
Corrupt .mar file or missing dependencies. Re-archive the model and check handler imports.
TorchServe Deployed Successfully!
Your TorchServe instance is now running in production on a RamNode VPS with Nginx reverse proxy, TLS encryption, rate limiting, and Prometheus-compatible monitoring.
Next Steps:
- Integrate Prometheus and Grafana for real-time inference dashboards
- Set up A/B testing with TorchServe's model versioning API
- Deploy multiple specialized models on the same instance
- Implement client-side request batching for high-throughput workloads
- Configure webhook-based model updates from your CI/CD pipeline
