Self-hosted, OpenAI-compatible AI inference. Run LLMs, generate images, process audio, and more — all on your own infrastructure for as little as $4/month.
Running AI models on your own infrastructure gives you complete control over your data, eliminates recurring API costs, and provides the flexibility to experiment with any open-source model. LocalAI is one of the most versatile self-hosted AI platforms available, offering a drop-in replacement for the OpenAI API that runs entirely on your hardware.
By the end of this guide, you'll have a fully functional, API-compatible AI inference server capable of running large language models, generating images, processing audio, and more.
Core inference server and API
Autonomous AI agent platform with OpenAI Responses API
REST/MCP API for semantic search and persistent memory
Go library for building cooperative agentic software
| Use Case | RAM | vCPUs | Storage | Plan |
|---|---|---|---|---|
| Small models (1–3B) | 4 GB | 2 | 40 GB SSD | KVM 4GB |
| Medium models (7–8B) | 8 GB | 4 | 80 GB SSD | KVM 8GB |
| Large models (13B+) | 16 GB+ | 6+ | 120 GB+ SSD | KVM 16GB+ |
| Multi-model / Production | 32 GB+ | 8+ | 200 GB+ SSD | KVM 32GB+ |
Begin with a 4GB or 8GB plan to test your workflow, then upgrade as needed. Quantized models (Q4_K_M) significantly reduce memory requirements — a 7B parameter model in Q4 quantization typically needs only 4–6 GB of RAM.
ssh root@YOUR_SERVER_IPapt update && apt upgrade -yadduser deploy
usermod -aG sudo deployufw allow OpenSSH
ufw allow 8080/tcp # LocalAI API
ufw allow 80/tcp # HTTP (for reverse proxy)
ufw allow 443/tcp # HTTPS (for reverse proxy)
ufw enableSwap provides a safety net when running larger models that approach your RAM limits.
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo "/swapfile none swap sw 0 0" >> /etc/fstab# Install prerequisites
apt install -y ca-certificates curl gnupg
# Add Docker's official GPG key
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| gpg --dearmor -o /etc/apt/keyrings/docker.gpg
chmod a+r /etc/apt/keyrings/docker.gpg
# Add the Docker repository
echo "deb [arch=$(dpkg --print-architecture) \
signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo $VERSION_CODENAME) stable" \
| tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker
apt update
apt install -y docker-ce docker-ce-cli containerd.io \
docker-buildx-plugin docker-compose-pluginusermod -aG docker deploy
newgrp docker
docker run hello-world| Image Tag | Use Case |
|---|---|
| localai/localai:latest | Standard CPU image, no pre-loaded models |
| localai/localai:latest-aio-cpu | All-in-One CPU: pre-configured models included |
| localai/localai:latest-gpu-nvidia-cuda-12 | NVIDIA GPU with CUDA 12 support |
| localai/localai:latest-gpu-nvidia-cuda-13 | NVIDIA GPU with CUDA 13 support |
| localai/localai:latest-gpu-hipblas | AMD GPU support |
docker run -d \
--name local-ai \
-p 8080:8080 \
-v localai-models:/models \
--restart unless-stopped \
localai/localai:latestmkdir -p /opt/localai && cd /opt/localaiversion: '3.8'
services:
localai:
image: localai/localai:latest-aio-cpu
container_name: local-ai
ports:
- "8080:8080"
volumes:
- ./models:/models
- ./config:/config
environment:
- THREADS=4
- CONTEXT_SIZE=2048
- MODELS_PATH=/models
- API_KEY=your-secret-api-key-here
- DEBUG=false
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 30s
timeout: 10s
retries: 3docker compose up -d
# Check container status
docker compose logs -f
# Test the API endpoint
curl http://localhost:8080/readyzAlways set the API_KEY environment variable when exposing LocalAI to the network. Without it, anyone who can reach port 8080 has full access to your instance.
curl -X POST http://localhost:8080/models/apply \
-H 'Content-Type: application/json' \
-d '{"id": "llama-3.2-1b-instruct:q4_k_m"}'curl -X POST http://localhost:8080/models/apply \
-H 'Content-Type: application/json' \
-d '{"url": "huggingface://TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf"}'curl -X POST http://localhost:8080/models/apply \
-H 'Content-Type: application/json' \
-d '{"url": "ollama://gemma:2b"}'| Model | Size | RAM | Best For |
|---|---|---|---|
| llama-3.2-1b-instruct | ~1 GB | 2 GB | Testing, lightweight tasks |
| phi-4-mini (Q4_K_M) | ~2.5 GB | 4 GB | Coding assistance, reasoning |
| mistral-7b-instruct (Q4) | ~4 GB | 6 GB | General-purpose chat |
| llama-3.1-8b-instruct (Q4) | ~5 GB | 8 GB | High-quality assistant |
| gemma-3-12b-it (Q4) | ~7 GB | 12 GB | Advanced reasoning, multilingual |
Q4_K_M quantization offers an excellent balance between model quality and resource usage. It reduces memory requirements by roughly 60–75% compared to full-precision models with minimal quality loss. Always prefer quantized models on VPS deployments.
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer your-secret-api-key-here' \
-d '{
"model": "llama-3.2-1b-instruct",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'curl http://localhost:8080/v1/embeddings \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer your-secret-api-key-here' \
-d '{
"model": "text-embedding-ada-002",
"input": "The quick brown fox"
}'Use the official OpenAI Python library — just point it at your LocalAI instance:
from openai import OpenAI
client = OpenAI(
base_url="http://YOUR_SERVER_IP:8080/v1",
api_key="your-secret-api-key-here"
)
response = client.chat.completions.create(
model="llama-3.2-1b-instruct",
messages=[
{"role": "user", "content": "Hello, how are you?"}
]
)
print(response.choices[0].message.content)curl http://localhost:8080/v1/models \
-H 'Authorization: Bearer your-secret-api-key-here'apt install -y nginx certbot python3-certbot-nginxserver {
listen 80;
server_name ai.yourdomain.com;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Increase timeouts for model loading and inference
proxy_read_timeout 300s;
proxy_send_timeout 300s;
proxy_connect_timeout 60s;
# Allow large request bodies for file uploads
client_max_body_size 100M;
}
}ln -s /etc/nginx/sites-available/localai /etc/nginx/sites-enabled/
nginx -t
systemctl reload nginx
# Obtain SSL certificate
certbot --nginx -d ai.yourdomain.comOnce Nginx is configured, remove the direct port 8080 firewall rule so all traffic flows through the reverse proxy:
ufw delete allow 8080/tcp# Add to the http block in /etc/nginx/nginx.conf
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
# Add inside your location block
limit_req zone=api burst=20 nodelay;apt install -y fail2ban
systemctl enable fail2ban
systemctl start fail2ban# Update system packages
apt update && apt upgrade -y
# Update LocalAI container
cd /opt/localai
docker compose pull
docker compose up -d| Variable | Default | Description |
|---|---|---|
| THREADS | Auto | CPU threads for inference (set to vCPU count) |
| CONTEXT_SIZE | 512 | Maximum context length in tokens |
| F16 | false | Half-precision for faster inference |
| PARALLEL_REQUEST | false | Enable parallel request processing |
| WATCHDOG_IDLE | true | Unload idle models to free memory |
| WATCHDOG_BUSY | true | Restart stuck model processes |
Create YAML configuration files in your models directory for fine-grained control:
name: mistral-7b-instruct
backend: llama-cpp
parameters:
model: mistral-7b-instruct-v0.2.Q4_K_M.gguf
context_size: 4096
threads: 4
template:
chat_message: "[INST] {{.Input}} [/INST]"docker stats local-aiLocalAGI is an autonomous AI agent platform that integrates with LocalAI. It provides OpenAI Responses API compatibility and supports advanced agentic capabilities including tool use, multi-step reasoning, and persistent memory.
git clone https://github.com/mudler/LocalAGI
cd LocalAGI
# CPU deployment
docker compose up -d
# Or with NVIDIA GPU
docker compose -f docker-compose.nvidia.yaml up -d# Set a specific model
MODEL_NAME=gemma-3-12b-it docker compose up -d
# Full multimodal setup with image generation
MODEL_NAME=gemma-3-12b-it \
MULTIMODAL_MODEL=minicpm-v-4_5 \
IMAGE_MODEL=flux.1-dev-ggml \
docker compose up -d