Deploy SGLang on a VPS
A high-performance LLM serving framework with a fully supported CPU backend — run Llama 3.2, Qwen 2.5, or Phi-3 on commodity Xeon hardware behind an OpenAI-compatible endpoint, no GPU required.
At a Glance
| Project | sgl-project/sglang |
| Stack | Python 3.12 + PyTorch (CPU) + SGLang SRT |
| Recommended Plan | Cloud VPS 8GB (Qwen 2.5 3B INT8); 12GB+ for Llama 3.2 3B BF16 |
| OS | Ubuntu 24.04 LTS |
| Reverse Proxy | Nginx with bearer-token auth + Let's Encrypt |
Sizing rule of thumb
Allocate model weights + 30 to 50% extra for KV cache, request buffers, and the Python runtime. A 3B BF16 model is ~6GB on disk but wants 10 to 12GB RAM to serve without thrashing. Always pre-download weights before the first server start.
Server Preparation
apt update && apt upgrade -y
apt install -y build-essential git curl wget htop \
python3.12 python3.12-venv python3-pip \
libnuma-dev numactl ca-certificatesuseradd -m -s /bin/bash sglang
usermod -aG sudo sglang
mkdir -p /home/sglang/.ssh
cp /root/.ssh/authorized_keys /home/sglang/.ssh/
chown -R sglang:sglang /home/sglang/.ssh
chmod 700 /home/sglang/.ssh && chmod 600 /home/sglang/.ssh/authorized_keysOn under-8GB plans, add a 4GB swap file so HuggingFace's loader does not get OOM-killed during the load phase (swap is not a substitute for RAM during inference itself):
fallocate -l 4G /swapfile
chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
sysctl vm.swappiness=10
echo 'vm.swappiness=10' >> /etc/sysctl.confInstall SGLang in CPU Mode
Switch to the sglang user, create a venv, then install the CPU build of PyTorch first. This is the single most common pitfall — if pip resolves to the CUDA wheel, SGLang will fail to import on a CPU host with cryptic CUDA errors.
su - sglang
python3.12 -m venv ~/sglang-env
source ~/sglang-env/bin/activate
pip install --upgrade pip wheel
pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cpu
python -c "import torch; print(torch.__version__, torch.cuda.is_available())"The output should end with +cpu False. If it says True, uninstall and reinstall against the CPU index.
pip install "sglang[srt]"
pip install transformers accelerate sentencepiece protobuf
python -c "import sglang; print(sglang.__version__)"Download a Model
Pre-pull weights so the first server start does not block on a slow HuggingFace download. Qwen 2.5 1.5B is the easiest smoke-test; bump to 3B if you have 8GB+ free.
pip install huggingface_hub
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct \
--local-dir ~/models/qwen2.5-1.5b-instructFor gated models (Llama 3.2, etc.), run huggingface-cli login first and paste an access token from your HF account.
Launch the SGLang Server
export SGLANG_USE_CPU_ENGINE=1
python -m sglang.launch_server \
--model-path ~/models/qwen2.5-1.5b-instruct \
--device cpu \
--host 127.0.0.1 \
--port 30000 \
--mem-fraction-static 0.8 \
--max-total-tokens 8192 \
--disable-overlap-schedule \
--trust-remote-codeStartup is 30 to 90 seconds while weights load. Wait for "The server is fired up and ready to roll!". --host 127.0.0.1 keeps the API loopback-only — never bind to 0.0.0.0; SGLang has no built-in auth. --disable-overlap-schedule is recommended for the CPU backend.
curl http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-1.5b-instruct",
"messages": [{"role": "user", "content": "Write one sentence about RamNode."}],
"temperature": 0.7,
"max_tokens": 80
}'Run as a systemd Service
[Unit]
Description=SGLang Inference Server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=sglang
Group=sglang
WorkingDirectory=/home/sglang
Environment="SGLANG_USE_CPU_ENGINE=1"
Environment="HF_HOME=/home/sglang/.cache/huggingface"
Environment="PATH=/home/sglang/sglang-env/bin:/usr/local/bin:/usr/bin:/bin"
ExecStart=/home/sglang/sglang-env/bin/python -m sglang.launch_server \
--model-path /home/sglang/models/qwen2.5-1.5b-instruct \
--device cpu \
--host 127.0.0.1 \
--port 30000 \
--mem-fraction-static 0.8 \
--max-total-tokens 8192 \
--disable-overlap-schedule \
--trust-remote-code
Restart=on-failure
RestartSec=15
TimeoutStartSec=300
LimitNOFILE=65535
# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/home/sglang
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
[Install]
WantedBy=multi-user.targetThe generous TimeoutStartSec=300 matters — model load on a small CPU VPS legitimately takes a couple of minutes; you do not want systemd killing the service while weights are still mapping into memory.
sudo systemctl daemon-reload
sudo systemctl enable --now sglang
sudo journalctl -u sglang -fNginx with Bearer-Token Auth + TLS
sudo apt install -y nginx certbot python3-certbot-nginx
sudo certbot --nginx -d llm.yourdomain.com
# Generate a strong API key
openssl rand -hex 32Replace the auto-generated nginx site at /etc/nginx/sites-available/llm.yourdomain.com:
map $http_authorization $auth_ok {
default 0;
"Bearer sk-ramnode-PASTE_YOUR_KEY_HERE" 1;
}
server {
listen 80;
listen [::]:80;
server_name llm.yourdomain.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name llm.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/llm.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/llm.yourdomain.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
client_max_body_size 8m;
proxy_read_timeout 600s;
proxy_send_timeout 600s;
proxy_buffering off;
location / {
if ($auth_ok = 0) {
return 401 '{"error":"unauthorized"}';
}
add_header Content-Type application/json always;
proxy_pass http://127.0.0.1:30000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Connection "";
}
}# Should return 401
curl -s https://llm.yourdomain.com/v1/models
# Should return the served model list
curl -s https://llm.yourdomain.com/v1/models \
-H "Authorization: Bearer sk-ramnode-PASTE_YOUR_KEY_HERE"Any OpenAI SDK now works against your endpoint by setting base_url and api_key. For multi-tenant scenarios, swap the simple map directive for oauth2-proxy or a small FastAPI sidecar that validates per-tenant keys.
Benchmark Your Deployment
source ~/sglang-env/bin/activate
python -m sglang.bench_serving \
--backend sglang \
--model qwen2.5-1.5b-instruct \
--base-url http://127.0.0.1:30000 \
--dataset-name random \
--random-input 256 \
--random-output 128 \
--num-prompts 50 \
--max-concurrency 4Watch Mean TTFT (time to first token) and Output throughput. On a 4 vCPU box with a 1.5B model expect roughly 10 to 30 tokens/sec per request, with aggregate throughput rising as you batch concurrent requests. If numbers are unacceptable: pick a smaller model, switch to a more aggressive quantization (INT8, AWQ 4-bit), or move to a GPU-backed deployment — the API surface is identical, so client code does not change.
Troubleshooting
- Server hangs at startup, never logs "ready": almost always a memory issue. Watch
htop— if the process is OOM-killed, drop to a smaller model or resize. ImportErrorrelated to CUDA: you have the CUDA build of PyTorch.pip uninstall -y torch torchvision torchaudioand reinstall with the CPU index URL.- Inference dramatically slower than expected: check
vmstat 2for swap activity. High swap = the model is overflowing RAM. - HTTP 401 with a valid header: the nginx
mapis a literal string match. Reproduce withcurl -v— whitespace, missingBearerprefix, or trailing newlines all fail. - Custom architectures (Qwen, DeepSeek): ensure
--trust-remote-codeis in both the systemd unit and any manual launch.
