CPU Inference

OpenAI-Compatible

Deploy SGLang on a VPS

A high-performance LLM serving framework with a fully supported CPU backend — run Llama 3.2, Qwen 2.5, or Phi-3 on commodity Xeon hardware behind an OpenAI-compatible endpoint, no GPU required.

At a Glance

Project	sgl-project/sglang
Stack	Python 3.12 + PyTorch (CPU) + SGLang SRT
Recommended Plan	Cloud VPS 8GB (Qwen 2.5 3B INT8); 12GB+ for Llama 3.2 3B BF16
OS	Ubuntu 24.04 LTS
Reverse Proxy	Nginx with bearer-token auth + Let's Encrypt

Sizing rule of thumb

Allocate model weights + 30 to 50% extra for KV cache, request buffers, and the Python runtime. A 3B BF16 model is ~6GB on disk but wants 10 to 12GB RAM to serve without thrashing. Always pre-download weights before the first server start.

Server Preparation

System packages

apt update && apt upgrade -y
apt install -y build-essential git curl wget htop \
    python3.12 python3.12-venv python3-pip \
    libnuma-dev numactl ca-certificates

Dedicated user

useradd -m -s /bin/bash sglang
usermod -aG sudo sglang
mkdir -p /home/sglang/.ssh
cp /root/.ssh/authorized_keys /home/sglang/.ssh/
chown -R sglang:sglang /home/sglang/.ssh
chmod 700 /home/sglang/.ssh && chmod 600 /home/sglang/.ssh/authorized_keys

On under-8GB plans, add a 4GB swap file so HuggingFace's loader does not get OOM-killed during the load phase (swap is not a substitute for RAM during inference itself):

Swap + swappiness

fallocate -l 4G /swapfile
chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
sysctl vm.swappiness=10
echo 'vm.swappiness=10' >> /etc/sysctl.conf

Install SGLang in CPU Mode

Switch to the sglang user, create a venv, then install the CPU build of PyTorch first. This is the single most common pitfall — if pip resolves to the CUDA wheel, SGLang will fail to import on a CPU host with cryptic CUDA errors.

venv + CPU PyTorch

su - sglang
python3.12 -m venv ~/sglang-env
source ~/sglang-env/bin/activate
pip install --upgrade pip wheel

pip install torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/cpu

python -c "import torch; print(torch.__version__, torch.cuda.is_available())"

The output should end with +cpu False. If it says True, uninstall and reinstall against the CPU index.

SGLang + helpers

pip install "sglang[srt]"
pip install transformers accelerate sentencepiece protobuf
python -c "import sglang; print(sglang.__version__)"

Download a Model

Pre-pull weights so the first server start does not block on a slow HuggingFace download. Qwen 2.5 1.5B is the easiest smoke-test; bump to 3B if you have 8GB+ free.

huggingface-cli

pip install huggingface_hub
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct \
    --local-dir ~/models/qwen2.5-1.5b-instruct

For gated models (Llama 3.2, etc.), run huggingface-cli login first and paste an access token from your HF account.

Launch the SGLang Server

Interactive launch

export SGLANG_USE_CPU_ENGINE=1

python -m sglang.launch_server \
    --model-path ~/models/qwen2.5-1.5b-instruct \
    --device cpu \
    --host 127.0.0.1 \
    --port 30000 \
    --mem-fraction-static 0.8 \
    --max-total-tokens 8192 \
    --disable-overlap-schedule \
    --trust-remote-code

Startup is 30 to 90 seconds while weights load. Wait for "The server is fired up and ready to roll!". --host 127.0.0.1 keeps the API loopback-only — never bind to 0.0.0.0; SGLang has no built-in auth. --disable-overlap-schedule is recommended for the CPU backend.

Smoke test

curl http://127.0.0.1:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen2.5-1.5b-instruct",
        "messages": [{"role": "user", "content": "Write one sentence about RamNode."}],
        "temperature": 0.7,
        "max_tokens": 80
    }'

Run as a systemd Service

/etc/systemd/system/sglang.service

[Unit]
Description=SGLang Inference Server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=sglang
Group=sglang
WorkingDirectory=/home/sglang
Environment="SGLANG_USE_CPU_ENGINE=1"
Environment="HF_HOME=/home/sglang/.cache/huggingface"
Environment="PATH=/home/sglang/sglang-env/bin:/usr/local/bin:/usr/bin:/bin"
ExecStart=/home/sglang/sglang-env/bin/python -m sglang.launch_server \
    --model-path /home/sglang/models/qwen2.5-1.5b-instruct \
    --device cpu \
    --host 127.0.0.1 \
    --port 30000 \
    --mem-fraction-static 0.8 \
    --max-total-tokens 8192 \
    --disable-overlap-schedule \
    --trust-remote-code
Restart=on-failure
RestartSec=15
TimeoutStartSec=300
LimitNOFILE=65535

# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/home/sglang
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true

[Install]
WantedBy=multi-user.target

The generous TimeoutStartSec=300 matters — model load on a small CPU VPS legitimately takes a couple of minutes; you do not want systemd killing the service while weights are still mapping into memory.

Enable + watch

sudo systemctl daemon-reload
sudo systemctl enable --now sglang
sudo journalctl -u sglang -f

Nginx with Bearer-Token Auth + TLS

Install + cert

sudo apt install -y nginx certbot python3-certbot-nginx
sudo certbot --nginx -d llm.yourdomain.com

# Generate a strong API key
openssl rand -hex 32

Replace the auto-generated nginx site at /etc/nginx/sites-available/llm.yourdomain.com:

nginx config

map $http_authorization $auth_ok {
    default                                  0;
    "Bearer sk-ramnode-PASTE_YOUR_KEY_HERE"  1;
}

server {
    listen 80;
    listen [::]:80;
    server_name llm.yourdomain.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server_name llm.yourdomain.com;

    ssl_certificate     /etc/letsencrypt/live/llm.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llm.yourdomain.com/privkey.pem;
    ssl_protocols       TLSv1.2 TLSv1.3;

    client_max_body_size 8m;
    proxy_read_timeout   600s;
    proxy_send_timeout   600s;
    proxy_buffering      off;

    location / {
        if ($auth_ok = 0) {
            return 401 '{"error":"unauthorized"}';
        }
        add_header Content-Type application/json always;

        proxy_pass http://127.0.0.1:30000;
        proxy_http_version 1.1;
        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Connection        "";
    }
}

Verify

# Should return 401
curl -s https://llm.yourdomain.com/v1/models

# Should return the served model list
curl -s https://llm.yourdomain.com/v1/models \
    -H "Authorization: Bearer sk-ramnode-PASTE_YOUR_KEY_HERE"

Any OpenAI SDK now works against your endpoint by setting base_url and api_key. For multi-tenant scenarios, swap the simple map directive for oauth2-proxy or a small FastAPI sidecar that validates per-tenant keys.

Benchmark Your Deployment

bench_serving

source ~/sglang-env/bin/activate
python -m sglang.bench_serving \
    --backend sglang \
    --model qwen2.5-1.5b-instruct \
    --base-url http://127.0.0.1:30000 \
    --dataset-name random \
    --random-input 256 \
    --random-output 128 \
    --num-prompts 50 \
    --max-concurrency 4

Watch Mean TTFT (time to first token) and Output throughput. On a 4 vCPU box with a 1.5B model expect roughly 10 to 30 tokens/sec per request, with aggregate throughput rising as you batch concurrent requests. If numbers are unacceptable: pick a smaller model, switch to a more aggressive quantization (INT8, AWQ 4-bit), or move to a GPU-backed deployment — the API surface is identical, so client code does not change.

Troubleshooting

Server hangs at startup, never logs "ready": almost always a memory issue. Watch htop — if the process is OOM-killed, drop to a smaller model or resize.
ImportError related to CUDA: you have the CUDA build of PyTorch. pip uninstall -y torch torchvision torchaudio and reinstall with the CPU index URL.
Inference dramatically slower than expected: check vmstat 2 for swap activity. High swap = the model is overflowing RAM.
HTTP 401 with a valid header: the nginx map is a literal string match. Reproduce with curl -v — whitespace, missing Bearer prefix, or trailing newlines all fail.
Custom architectures (Qwen, DeepSeek): ensure --trust-remote-code is in both the systemd unit and any manual launch.

More Deployment Guides•Aphrodite Engine Guide•Letta Guide