Deploy LocalAI on Your RamNode VPS

Self-hosted, OpenAI-compatible AI inference. Run LLMs, generate images, process audio, and more — all on your own infrastructure for as little as $4/month.

20-30 min

Setup Time

4GB+ RAM

Minimum

Beginner

Difficulty

Port 8080

Default Port

Overview

Running AI models on your own infrastructure gives you complete control over your data, eliminates recurring API costs, and provides the flexibility to experiment with any open-source model. LocalAI is one of the most versatile self-hosted AI platforms available, offering a drop-in replacement for the OpenAI API that runs entirely on your hardware.

By the end of this guide, you'll have a fully functional, API-compatible AI inference server capable of running large language models, generating images, processing audio, and more.

Key Features

OpenAI API compatible — drop-in replacement for chat, embeddings, image generation, audio

No GPU required — runs on CPU with consumer-grade hardware

Multi-model support — GGUF, Safetensors, GPTQ, AWQ, and more

Model gallery — pre-configured models ready to run

Built-in web UI — browser-based chat and model management

MCP support — Model Context Protocol for agentic capabilities

Modular backends — downloaded on-demand, keeping core lightweight

Privacy-first — all processing on your server, no data leaves

The LocalAI Ecosystem

LocalAI

Core inference server and API

LocalAGI

Autonomous AI agent platform with OpenAI Responses API

LocalRecall

REST/MCP API for semantic search and persistent memory

Cogito

Go library for building cooperative agentic software

RamNode VPS Requirements

Recommended Plans

Use Case	RAM	vCPUs	Storage	Plan
Small models (1–3B)	4 GB	2	40 GB SSD	KVM 4GB
Medium models (7–8B)	8 GB	4	80 GB SSD	KVM 8GB
Large models (13B+)	16 GB+	6+	120 GB+ SSD	KVM 16GB+
Multi-model / Production	32 GB+	8+	200 GB+ SSD	KVM 32GB+

Tip: Start Small, Scale Up

Begin with a 4GB or 8GB plan to test your workflow, then upgrade as needed. Quantized models (Q4_K_M) significantly reduce memory requirements — a 7B parameter model in Q4 quantization typically needs only 4–6 GB of RAM.

View All VPS Plans

Initial Server Setup

Connect & Harden

Connect to Your VPS

ssh root@YOUR_SERVER_IP

Update System Packages

apt update && apt upgrade -y

Create a Non-Root User

adduser deploy
usermod -aG sudo deploy

Configure Firewall

ufw allow OpenSSH
ufw allow 8080/tcp     # LocalAI API
ufw allow 80/tcp       # HTTP (for reverse proxy)
ufw allow 443/tcp      # HTTPS (for reverse proxy)
ufw enable

Set Up Swap Space (Recommended)

Swap provides a safety net when running larger models that approach your RAM limits.

Create Swap File

fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo "/swapfile none swap sw 0 0" >> /etc/fstab

Installing Docker

Install Docker Engine

# Install prerequisites
apt install -y ca-certificates curl gnupg

# Add Docker's official GPG key
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
  | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
chmod a+r /etc/apt/keyrings/docker.gpg

# Add the Docker repository
echo "deb [arch=$(dpkg --print-architecture) \
  signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
  | tee /etc/apt/sources.list.d/docker.list > /dev/null

# Install Docker
apt update
apt install -y docker-ce docker-ce-cli containerd.io \
  docker-buildx-plugin docker-compose-plugin

Add User & Verify

usermod -aG docker deploy
newgrp docker
docker run hello-world

Deploying LocalAI with Docker

Image Selection

Image Tag	Use Case
localai/localai:latest	Standard CPU image, no pre-loaded models
localai/localai:latest-aio-cpu	All-in-One CPU: pre-configured models included
localai/localai:latest-gpu-nvidia-cuda-12	NVIDIA GPU with CUDA 12 support
localai/localai:latest-gpu-nvidia-cuda-13	NVIDIA GPU with CUDA 13 support
localai/localai:latest-gpu-hipblas	AMD GPU support

Quick Start (CPU)

Quick Start

docker run -d \
  --name local-ai \
  -p 8080:8080 \
  -v localai-models:/models \
  --restart unless-stopped \
  localai/localai:latest

RECOMMENDED

Production Deployment with Docker Compose

Create Project Directory

mkdir -p /opt/localai && cd /opt/localai

docker-compose.yml

version: '3.8'

services:
  localai:
    image: localai/localai:latest-aio-cpu
    container_name: local-ai
    ports:
      - "8080:8080"
    volumes:
      - ./models:/models
      - ./config:/config
    environment:
      - THREADS=4
      - CONTEXT_SIZE=2048
      - MODELS_PATH=/models
      - API_KEY=your-secret-api-key-here
      - DEBUG=false
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 30s
      timeout: 10s
      retries: 3

Start & Verify

docker compose up -d

# Check container status
docker compose logs -f

# Test the API endpoint
curl http://localhost:8080/readyz

API Key Security

Always set the API_KEY environment variable when exposing LocalAI to the network. Without it, anyone who can reach port 8080 has full access to your instance.

Installing and Running Models

Install from Gallery, Hugging Face, or Ollama

From the LocalAI Gallery

curl -X POST http://localhost:8080/models/apply \
  -H 'Content-Type: application/json' \
  -d '{"id": "llama-3.2-1b-instruct:q4_k_m"}'

From Hugging Face

curl -X POST http://localhost:8080/models/apply \
  -H 'Content-Type: application/json' \
  -d '{"url": "huggingface://TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf"}'

From Ollama Registry

curl -X POST http://localhost:8080/models/apply \
  -H 'Content-Type: application/json' \
  -d '{"url": "ollama://gemma:2b"}'

Recommended Starter Models

Model	Size	RAM	Best For
llama-3.2-1b-instruct	~1 GB	2 GB	Testing, lightweight tasks
phi-4-mini (Q4_K_M)	~2.5 GB	4 GB	Coding assistance, reasoning
mistral-7b-instruct (Q4)	~4 GB	6 GB	General-purpose chat
llama-3.1-8b-instruct (Q4)	~5 GB	8 GB	High-quality assistant
gemma-3-12b-it (Q4)	~7 GB	12 GB	Advanced reasoning, multilingual

Quantization Matters

Q4_K_M quantization offers an excellent balance between model quality and resource usage. It reduces memory requirements by roughly 60–75% compared to full-precision models with minimal quality loss. Always prefer quantized models on VPS deployments.

Using the LocalAI API

Chat Completions

Chat Completions Request

curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer your-secret-api-key-here' \
  -d '{
    "model": "llama-3.2-1b-instruct",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Embeddings

Embeddings Request

curl http://localhost:8080/v1/embeddings \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer your-secret-api-key-here' \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "The quick brown fox"
  }'

Python Client Example

Use the official OpenAI Python library — just point it at your LocalAI instance:

Python OpenAI Client

from openai import OpenAI

client = OpenAI(
    base_url="http://YOUR_SERVER_IP:8080/v1",
    api_key="your-secret-api-key-here"
)

response = client.chat.completions.create(
    model="llama-3.2-1b-instruct",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

print(response.choices[0].message.content)

List Available Models

List Models

curl http://localhost:8080/v1/models \
  -H 'Authorization: Bearer your-secret-api-key-here'

Configuring a Reverse Proxy with Nginx

Install Nginx & Create Configuration

Install Nginx & Certbot

apt install -y nginx certbot python3-certbot-nginx

/etc/nginx/sites-available/localai

server {
    listen 80;
    server_name ai.yourdomain.com;

    location / {
        proxy_pass http://127.0.0.1:8080;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Increase timeouts for model loading and inference
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
        proxy_connect_timeout 60s;

        # Allow large request bodies for file uploads
        client_max_body_size 100M;
    }
}

Enable the Site & Obtain SSL

Enable Site & SSL

ln -s /etc/nginx/sites-available/localai /etc/nginx/sites-enabled/
nginx -t
systemctl reload nginx

# Obtain SSL certificate
certbot --nginx -d ai.yourdomain.com

Securing Your Deployment

Restrict Direct Port Access

Once Nginx is configured, remove the direct port 8080 firewall rule so all traffic flows through the reverse proxy:

Remove Direct Access

ufw delete allow 8080/tcp

Rate Limiting with Nginx

Nginx Rate Limiting

# Add to the http block in /etc/nginx/nginx.conf
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

# Add inside your location block
limit_req zone=api burst=20 nodelay;

Fail2ban Integration

Install Fail2ban

apt install -y fail2ban
systemctl enable fail2ban
systemctl start fail2ban

Keep Everything Updated

Update System & LocalAI

# Update system packages
apt update && apt upgrade -y

# Update LocalAI container
cd /opt/localai
docker compose pull
docker compose up -d

Performance Tuning

Environment Variables

Variable	Default	Description
THREADS	Auto	CPU threads for inference (set to vCPU count)
CONTEXT_SIZE	512	Maximum context length in tokens
F16	false	Half-precision for faster inference
PARALLEL_REQUEST	false	Enable parallel request processing
WATCHDOG_IDLE	true	Unload idle models to free memory
WATCHDOG_BUSY	true	Restart stuck model processes

Model Configuration Files

Create YAML configuration files in your models directory for fine-grained control:

models/mistral.yaml

name: mistral-7b-instruct
backend: llama-cpp
parameters:
  model: mistral-7b-instruct-v0.2.Q4_K_M.gguf
  context_size: 4096
  threads: 4
template:
  chat_message: "[INST] {{.Input}} [/INST]"

Memory Optimization Tips

Use Q4_K_M quantization for the best balance of quality and memory efficiency
Set CONTEXT_SIZE only as high as you need — 4096 is a good default
Enable WATCHDOG_IDLE to automatically unload unused models
Run one model at a time on VPS plans with 8 GB RAM or less
Monitor memory usage with: docker stats local-ai

Deploying LocalAGI (Optional)

LocalAGI is an autonomous AI agent platform that integrates with LocalAI. It provides OpenAI Responses API compatibility and supports advanced agentic capabilities including tool use, multi-step reasoning, and persistent memory.

Deploy LocalAGI

Clone & Deploy

git clone https://github.com/mudler/LocalAGI
cd LocalAGI

# CPU deployment
docker compose up -d

# Or with NVIDIA GPU
docker compose -f docker-compose.nvidia.yaml up -d

Customize Model

# Set a specific model
MODEL_NAME=gemma-3-12b-it docker compose up -d

# Full multimodal setup with image generation
MODEL_NAME=gemma-3-12b-it \
MULTIMODAL_MODEL=minicpm-v-4_5 \
IMAGE_MODEL=flux.1-dev-ggml \
docker compose up -d

Troubleshooting

Next Steps

Connect a front-end UI like Open WebUI or LibreChat for a polished chat experience
Set up LocalRecall for RAG (retrieval-augmented generation) with your own documents
Explore the model gallery for specialized models — image generation, code completion, audio transcription
Set up monitoring with Grafana and Prometheus to track inference latency
Implement automated backups of your model configurations and data
Explore MCP integration for building autonomous AI agents

Related Deployment Guides

Ollama — Alternative LLM Runner Open WebUI — Chat Interface LibreChat — Multi-Provider Chat Self-Hosted AI Stack Series

More Deployment Guides View Cloud VPS Plans