Ollama — Run LLMs on CPU
Deploy large language models on your VPS without a GPU. Model selection for 2GB, 4GB, and 8GB RAM tiers starting at $4/month.
RamNode VPS (any plan), Ubuntu 22.04/24.04, SSH access
20–30 minutes
2GB ($10/mo) for small models; 4GB ($20/mo) recommended; 8GB ($40/mo) for best selection
Looking for a quick-start guide? Check out our standalone Ollama Deployment Guide for a streamlined setup walkthrough.
Introduction
ChatGPT Plus costs $20/month per user. OpenAI API costs scale unpredictably with usage. And every prompt you send leaves your infrastructure, exposing sensitive data to third-party servers.
There's a better way. Modern quantized language models run surprisingly well on CPU-only VPS instances — no expensive GPU required. By the end of this 8-part series, you'll have a fully private AI platform running on a single RamNode VPS for a fraction of the cost.
💰 Series Cost Comparison
Commercial AI stack (ChatGPT Team + Copilot + Pinecone + Zapier AI + API costs): $390–690+/month. Your RamNode VPS running the complete self-hosted stack: $40/month.
Why Ollama on CPU?
GPU instances cost $50–300+/month. But modern quantized models in GGUF format run well on CPU-only hardware thanks to efficient inference engines.
Understanding Quantization
Quantization reduces model precision from 16-bit floats to lower bit widths, dramatically shrinking memory requirements with minimal quality loss:
| Quantization | Size Reduction | Quality Impact | Best For |
|---|---|---|---|
| Q8_0 | ~50% | Minimal | 8GB+ RAM, best quality |
| Q5_K_M | ~65% | Very low | 4–8GB RAM, great balance |
| Q4_K_M | ~75% | Low | Sweet spot for CPU inference |
| Q3_K_M | ~80% | Moderate | 2GB RAM, constrained setups |
Q4_K_M is the sweet spot — it preserves most of the model's capability while fitting comfortably in modest RAM allocations. RamNode's VPS plans offer the best price-performance ratio for this workload.
Installing Ollama
Ollama provides a one-line installer that handles everything:
curl -fsSL https://ollama.com/install.sh | shVerify the installation:
ollama --versionConfigure for Network Access
By default, Ollama listens only on localhost. For later parts of this series (Open WebUI, n8n, etc.), configure it to accept connections from Docker containers:
sudo systemctl edit ollama.serviceAdd the following override:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"Reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama⚠️ Security Note: Setting OLLAMA_HOST=0.0.0.0 exposes Ollama on all interfaces. Protect it with a firewall — only allow access from localhost and your Docker network:
sudo ufw allow from 172.16.0.0/12 to any port 11434
sudo ufw deny 11434Model Selection by RAM Tier
This is the centerpiece of your Ollama deployment. Choose models based on your VPS plan's available RAM:
2GB RAM — $10/month
Small but capable models for lightweight tasks:
| Model | Size | RAM Usage | Best For |
|---|---|---|---|
| tinyllama | 637 MB | ~1.1 GB | Quick Q&A, summarization |
| phi | 1.6 GB | ~1.8 GB | Reasoning, general tasks |
| gemma:2b | 1.4 GB | ~1.6 GB | Google's efficient small model |
ollama pull tinyllama
ollama pull phi4GB RAM — $20/month Recommended
The sweet spot — access to powerful 7B parameter models:
| Model | Size | RAM Usage | Best For |
|---|---|---|---|
| mistral | 4.1 GB | ~3.8 GB | General purpose, conversation |
| llama3.1:8b | 4.7 GB | ~3.9 GB | Meta's latest, excellent reasoning |
| codegemma | 5.0 GB | ~3.8 GB | Code generation & completion |
ollama pull mistral
ollama pull llama3.1:8b8GB RAM — $40/month
Full model selection — larger quantizations and bigger models:
| Model | Size | RAM Usage | Best For |
|---|---|---|---|
| llama3.1:8b (Q5) | 5.5 GB | ~6.2 GB | Higher quality reasoning |
| deepseek-coder:6.7b | 3.8 GB | ~5.5 GB | Specialized code generation |
| mixtral (Q4) | 26 GB | ~7.5 GB | Mixture of experts, best quality |
ollama pull llama3.1:8b
ollama pull deepseek-coder:6.7bRunning Your First Inference
Interactive Chat
Start an interactive session with your model:
ollama run mistralType your prompt and press Enter. Use /bye to exit.
REST API
Ollama exposes a REST API for programmatic access — this is how Open WebUI, n8n, and CrewAI will connect in later parts:
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Explain containers in one paragraph.",
"stream": false
}'curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Write a bash script to monitor disk usage.",
"stream": true
}'curl http://localhost:11434/api/chat -d '{
"model": "mistral",
"messages": [
{"role": "system", "content": "You are a helpful DevOps assistant."},
{"role": "user", "content": "How do I set up a cron job?"}
],
"stream": false
}'Performance Tuning
Parallel Requests
Control how many requests Ollama handles simultaneously. For CPU inference, keep this conservative:
[Service]
Environment="OLLAMA_NUM_PARALLEL=2"Context Window vs RAM Tradeoff
Larger context windows use more RAM. The default is typically 2048 tokens. Adjust per-model:
FROM mistral
PARAMETER num_ctx 4096ollama create mistral-4k -f Modelfile| Context Size | Extra RAM | Good For |
|---|---|---|
| 2048 (default) | Baseline | Short conversations, Q&A |
| 4096 | +~500 MB | Longer conversations, code review |
| 8192 | +~1.5 GB | Document analysis, RAG (Part 3) |
Swap Space as Safety Net
Add swap to prevent OOM kills if a model temporarily exceeds available RAM:
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstabMonitor Resource Usage
# Watch memory and CPU in real-time
htop
# Check Ollama-specific resource usage
ollama psPersistence & Auto-Start
Ollama's installer configures systemd automatically. Verify it survives reboots:
sudo systemctl enable ollama
sudo systemctl status ollamaModel Pre-Pull Script
Create a script that ensures your preferred models are available after a fresh deployment:
#!/bin/bash
# Pre-pull models after restart
MODELS="mistral llama3.1:8b"
for model in $MODELS; do
echo "Pulling $model..."
ollama pull "$model"
done
echo "All models ready."sudo chmod +x /usr/local/bin/ollama-pull-models.shHealth Check
Verify Ollama is responding:
curl http://localhost:11434/api/tagsThis returns a JSON list of all available models — useful for monitoring and integration testing.
What's Next?
You now have a local LLM inference server running on your VPS. In Part 2: Open WebUI, we'll give your Ollama instance a polished ChatGPT-like interface with:
- Multi-user support with role-based access control
- Conversation history and model switching
- File uploads and document preview
- Custom system prompts and model presets
Running LLMs on a $5/month VPS is just the beginning. Part 2 turns it into a team-ready AI chat platform.
