Back to Deployment Guides

    Deploy LocalAI on Your RamNode VPS

    Self-hosted, OpenAI-compatible AI inference. Run LLMs, generate images, process audio, and more — all on your own infrastructure for as little as $4/month.

    20-30 min
    Setup Time
    4GB+ RAM
    Minimum
    Beginner
    Difficulty
    Port 8080
    Default Port

    Overview

    Running AI models on your own infrastructure gives you complete control over your data, eliminates recurring API costs, and provides the flexibility to experiment with any open-source model. LocalAI is one of the most versatile self-hosted AI platforms available, offering a drop-in replacement for the OpenAI API that runs entirely on your hardware.

    By the end of this guide, you'll have a fully functional, API-compatible AI inference server capable of running large language models, generating images, processing audio, and more.

    Key Features

    OpenAI API compatible — drop-in replacement for chat, embeddings, image generation, audio
    No GPU required — runs on CPU with consumer-grade hardware
    Multi-model support — GGUF, Safetensors, GPTQ, AWQ, and more
    Model gallery — pre-configured models ready to run
    Built-in web UI — browser-based chat and model management
    MCP support — Model Context Protocol for agentic capabilities
    Modular backends — downloaded on-demand, keeping core lightweight
    Privacy-first — all processing on your server, no data leaves

    The LocalAI Ecosystem

    LocalAI

    Core inference server and API

    LocalAGI

    Autonomous AI agent platform with OpenAI Responses API

    LocalRecall

    REST/MCP API for semantic search and persistent memory

    Cogito

    Go library for building cooperative agentic software

    RamNode VPS Requirements

    Recommended Plans

    Use CaseRAMvCPUsStoragePlan
    Small models (1–3B)4 GB240 GB SSDKVM 4GB
    Medium models (7–8B)8 GB480 GB SSDKVM 8GB
    Large models (13B+)16 GB+6+120 GB+ SSDKVM 16GB+
    Multi-model / Production32 GB+8+200 GB+ SSDKVM 32GB+

    Tip: Start Small, Scale Up

    Begin with a 4GB or 8GB plan to test your workflow, then upgrade as needed. Quantized models (Q4_K_M) significantly reduce memory requirements — a 7B parameter model in Q4 quantization typically needs only 4–6 GB of RAM.

    Initial Server Setup

    Connect & Harden

    Connect to Your VPS
    ssh root@YOUR_SERVER_IP
    Update System Packages
    apt update && apt upgrade -y
    Create a Non-Root User
    adduser deploy
    usermod -aG sudo deploy
    Configure Firewall
    ufw allow OpenSSH
    ufw allow 8080/tcp     # LocalAI API
    ufw allow 80/tcp       # HTTP (for reverse proxy)
    ufw allow 443/tcp      # HTTPS (for reverse proxy)
    ufw enable

    Set Up Swap Space (Recommended)

    Swap provides a safety net when running larger models that approach your RAM limits.

    Create Swap File
    fallocate -l 4G /swapfile
    chmod 600 /swapfile
    mkswap /swapfile
    swapon /swapfile
    echo "/swapfile none swap sw 0 0" >> /etc/fstab

    Installing Docker

    Install Docker Engine
    # Install prerequisites
    apt install -y ca-certificates curl gnupg
    
    # Add Docker's official GPG key
    install -m 0755 -d /etc/apt/keyrings
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
      | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
    chmod a+r /etc/apt/keyrings/docker.gpg
    
    # Add the Docker repository
    echo "deb [arch=$(dpkg --print-architecture) \
      signed-by=/etc/apt/keyrings/docker.gpg] \
      https://download.docker.com/linux/ubuntu \
      $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
      | tee /etc/apt/sources.list.d/docker.list > /dev/null
    
    # Install Docker
    apt update
    apt install -y docker-ce docker-ce-cli containerd.io \
      docker-buildx-plugin docker-compose-plugin
    Add User & Verify
    usermod -aG docker deploy
    newgrp docker
    docker run hello-world

    Deploying LocalAI with Docker

    Image Selection

    Image TagUse Case
    localai/localai:latestStandard CPU image, no pre-loaded models
    localai/localai:latest-aio-cpuAll-in-One CPU: pre-configured models included
    localai/localai:latest-gpu-nvidia-cuda-12NVIDIA GPU with CUDA 12 support
    localai/localai:latest-gpu-nvidia-cuda-13NVIDIA GPU with CUDA 13 support
    localai/localai:latest-gpu-hipblasAMD GPU support

    Quick Start (CPU)

    Quick Start
    docker run -d \
      --name local-ai \
      -p 8080:8080 \
      -v localai-models:/models \
      --restart unless-stopped \
      localai/localai:latest
    RECOMMENDED

    Production Deployment with Docker Compose

    Create Project Directory
    mkdir -p /opt/localai && cd /opt/localai
    docker-compose.yml
    version: '3.8'
    
    services:
      localai:
        image: localai/localai:latest-aio-cpu
        container_name: local-ai
        ports:
          - "8080:8080"
        volumes:
          - ./models:/models
          - ./config:/config
        environment:
          - THREADS=4
          - CONTEXT_SIZE=2048
          - MODELS_PATH=/models
          - API_KEY=your-secret-api-key-here
          - DEBUG=false
        restart: unless-stopped
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
          interval: 30s
          timeout: 10s
          retries: 3
    Start & Verify
    docker compose up -d
    
    # Check container status
    docker compose logs -f
    
    # Test the API endpoint
    curl http://localhost:8080/readyz

    API Key Security

    Always set the API_KEY environment variable when exposing LocalAI to the network. Without it, anyone who can reach port 8080 has full access to your instance.

    Installing and Running Models

    Install from Gallery, Hugging Face, or Ollama

    From the LocalAI Gallery
    curl -X POST http://localhost:8080/models/apply \
      -H 'Content-Type: application/json' \
      -d '{"id": "llama-3.2-1b-instruct:q4_k_m"}'
    From Hugging Face
    curl -X POST http://localhost:8080/models/apply \
      -H 'Content-Type: application/json' \
      -d '{"url": "huggingface://TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf"}'
    From Ollama Registry
    curl -X POST http://localhost:8080/models/apply \
      -H 'Content-Type: application/json' \
      -d '{"url": "ollama://gemma:2b"}'

    Recommended Starter Models

    ModelSizeRAMBest For
    llama-3.2-1b-instruct~1 GB2 GBTesting, lightweight tasks
    phi-4-mini (Q4_K_M)~2.5 GB4 GBCoding assistance, reasoning
    mistral-7b-instruct (Q4)~4 GB6 GBGeneral-purpose chat
    llama-3.1-8b-instruct (Q4)~5 GB8 GBHigh-quality assistant
    gemma-3-12b-it (Q4)~7 GB12 GBAdvanced reasoning, multilingual

    Quantization Matters

    Q4_K_M quantization offers an excellent balance between model quality and resource usage. It reduces memory requirements by roughly 60–75% compared to full-precision models with minimal quality loss. Always prefer quantized models on VPS deployments.

    Using the LocalAI API

    Chat Completions

    Chat Completions Request
    curl http://localhost:8080/v1/chat/completions \
      -H 'Content-Type: application/json' \
      -H 'Authorization: Bearer your-secret-api-key-here' \
      -d '{
        "model": "llama-3.2-1b-instruct",
        "messages": [
          {"role": "user", "content": "Hello, how are you?"}
        ]
      }'

    Embeddings

    Embeddings Request
    curl http://localhost:8080/v1/embeddings \
      -H 'Content-Type: application/json' \
      -H 'Authorization: Bearer your-secret-api-key-here' \
      -d '{
        "model": "text-embedding-ada-002",
        "input": "The quick brown fox"
      }'

    Python Client Example

    Use the official OpenAI Python library — just point it at your LocalAI instance:

    Python OpenAI Client
    from openai import OpenAI
    
    client = OpenAI(
        base_url="http://YOUR_SERVER_IP:8080/v1",
        api_key="your-secret-api-key-here"
    )
    
    response = client.chat.completions.create(
        model="llama-3.2-1b-instruct",
        messages=[
            {"role": "user", "content": "Hello, how are you?"}
        ]
    )
    
    print(response.choices[0].message.content)

    List Available Models

    List Models
    curl http://localhost:8080/v1/models \
      -H 'Authorization: Bearer your-secret-api-key-here'

    Configuring a Reverse Proxy with Nginx

    Install Nginx & Create Configuration

    Install Nginx & Certbot
    apt install -y nginx certbot python3-certbot-nginx
    /etc/nginx/sites-available/localai
    server {
        listen 80;
        server_name ai.yourdomain.com;
    
        location / {
            proxy_pass http://127.0.0.1:8080;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
    
            # Increase timeouts for model loading and inference
            proxy_read_timeout 300s;
            proxy_send_timeout 300s;
            proxy_connect_timeout 60s;
    
            # Allow large request bodies for file uploads
            client_max_body_size 100M;
        }
    }

    Enable the Site & Obtain SSL

    Enable Site & SSL
    ln -s /etc/nginx/sites-available/localai /etc/nginx/sites-enabled/
    nginx -t
    systemctl reload nginx
    
    # Obtain SSL certificate
    certbot --nginx -d ai.yourdomain.com

    Securing Your Deployment

    Restrict Direct Port Access

    Once Nginx is configured, remove the direct port 8080 firewall rule so all traffic flows through the reverse proxy:

    Remove Direct Access
    ufw delete allow 8080/tcp

    Rate Limiting with Nginx

    Nginx Rate Limiting
    # Add to the http block in /etc/nginx/nginx.conf
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
    
    # Add inside your location block
    limit_req zone=api burst=20 nodelay;

    Fail2ban Integration

    Install Fail2ban
    apt install -y fail2ban
    systemctl enable fail2ban
    systemctl start fail2ban

    Keep Everything Updated

    Update System & LocalAI
    # Update system packages
    apt update && apt upgrade -y
    
    # Update LocalAI container
    cd /opt/localai
    docker compose pull
    docker compose up -d

    Performance Tuning

    Environment Variables

    VariableDefaultDescription
    THREADSAutoCPU threads for inference (set to vCPU count)
    CONTEXT_SIZE512Maximum context length in tokens
    F16falseHalf-precision for faster inference
    PARALLEL_REQUESTfalseEnable parallel request processing
    WATCHDOG_IDLEtrueUnload idle models to free memory
    WATCHDOG_BUSYtrueRestart stuck model processes

    Model Configuration Files

    Create YAML configuration files in your models directory for fine-grained control:

    models/mistral.yaml
    name: mistral-7b-instruct
    backend: llama-cpp
    parameters:
      model: mistral-7b-instruct-v0.2.Q4_K_M.gguf
      context_size: 4096
      threads: 4
    template:
      chat_message: "[INST] {{.Input}} [/INST]"

    Memory Optimization Tips

    • Use Q4_K_M quantization for the best balance of quality and memory efficiency
    • Set CONTEXT_SIZE only as high as you need — 4096 is a good default
    • Enable WATCHDOG_IDLE to automatically unload unused models
    • Run one model at a time on VPS plans with 8 GB RAM or less
    • Monitor memory usage with: docker stats local-ai

    Deploying LocalAGI (Optional)

    LocalAGI is an autonomous AI agent platform that integrates with LocalAI. It provides OpenAI Responses API compatibility and supports advanced agentic capabilities including tool use, multi-step reasoning, and persistent memory.

    Deploy LocalAGI

    Clone & Deploy
    git clone https://github.com/mudler/LocalAGI
    cd LocalAGI
    
    # CPU deployment
    docker compose up -d
    
    # Or with NVIDIA GPU
    docker compose -f docker-compose.nvidia.yaml up -d
    Customize Model
    # Set a specific model
    MODEL_NAME=gemma-3-12b-it docker compose up -d
    
    # Full multimodal setup with image generation
    MODEL_NAME=gemma-3-12b-it \
    MULTIMODAL_MODEL=minicpm-v-4_5 \
    IMAGE_MODEL=flux.1-dev-ggml \
    docker compose up -d

    Troubleshooting

    Next Steps

    • Connect a front-end UI like Open WebUI or LibreChat for a polished chat experience
    • Set up LocalRecall for RAG (retrieval-augmented generation) with your own documents
    • Explore the model gallery for specialized models — image generation, code completion, audio transcription
    • Set up monitoring with Grafana and Prometheus to track inference latency
    • Implement automated backups of your model configurations and data
    • Explore MCP integration for building autonomous AI agents