LLM Provider Configuration
Configure cloud LLMs (OpenAI, Claude, Gemini, Groq) and local models with Ollama. Optimize for cost, privacy, or capability with hybrid setups.
Agent Zero is model-agnostic—it works with virtually any LLM provider, from cloud APIs to locally-hosted models. This flexibility lets you optimize for cost, privacy, speed, or capability depending on your needs.
Understanding Agent Zero's Model Architecture
Agent Zero uses three types of models for different purposes:
Chat Model
The primary reasoning engine. Handles complex tasks, code generation, and multi-step problem solving. This should be your most capable model.
Utility Model
Handles lightweight tasks like memory summarization, context compression, and quick lookups. Can be smaller and cheaper than the chat model.
Embedding Model
Converts text into vector representations for memory search and knowledge retrieval. Runs frequently but uses minimal resources.
This separation lets you allocate expensive, capable models where they matter most while using efficient models for routine operations.
OpenAI Configuration
OpenAI offers the most straightforward setup and remains a solid default choice. Edit your .env file:
cd ~/agent-zero
nano .envConfigure OpenAI:
# OpenAI Configuration
API_KEY_OPENAI=sk-your-api-key-here
# Model Selection
CHAT_MODEL_OPENAI=gpt-4o
UTILITY_MODEL_OPENAI=gpt-4o-mini
EMBEDDING_MODEL_OPENAI=text-embedding-3-smallModel Recommendations
| Use Case | Model | Notes |
|---|---|---|
| Best reasoning | gpt-4o | Strongest for complex code and multi-step tasks |
| Balanced | gpt-4o-mini | Good capability at lower cost |
| Budget | gpt-3.5-turbo | Faster, cheaper, less capable |
Anthropic Claude
Claude excels at nuanced reasoning and longer context windows. Configure it alongside or instead of OpenAI:
# Anthropic Configuration
API_KEY_ANTHROPIC=sk-ant-your-api-key-here
# Model Selection
CHAT_MODEL_ANTHROPIC=claude-sonnet-4-20250514
UTILITY_MODEL_ANTHROPIC=claude-haiku-3-5-20241022
EMBEDDING_MODEL_ANTHROPIC=voyage-2 # Note: Anthropic uses Voyage for embeddingsGet your API key from console.anthropic.com.
Model Recommendations
| Use Case | Model | Notes |
|---|---|---|
| Best reasoning | claude-sonnet-4-20250514 | Excellent for code and analysis |
| Extended context | claude-sonnet-4-20250514 | 200K token context window |
| Budget | claude-haiku-3-5-20241022 | Fast and cost-effective |
Groq (Fast Inference)
Groq provides extremely fast inference using custom hardware. Response times are often 10x faster than other providers, making it excellent for interactive use:
# Groq Configuration
API_KEY_GROQ=gsk_your-api-key-here
# Model Selection
CHAT_MODEL_GROQ=llama-3.3-70b-versatile
UTILITY_MODEL_GROQ=llama-3.1-8b-instant
EMBEDDING_MODEL_GROQ=nomic-embed-text # Or use another providerGet your API key from console.groq.com. Groq's free tier is generous for testing, and paid usage remains very affordable.
Model Recommendations
| Use Case | Model | Notes |
|---|---|---|
| Best reasoning | llama-3.3-70b-versatile | Strong open model, very fast |
| Balanced | mixtral-8x7b-32768 | Good MoE model with 32K context |
| Speed | llama-3.1-8b-instant | Blazing fast for simple tasks |
Google Gemini
Gemini offers competitive models with generous free tiers:
# Google Configuration
API_KEY_GOOGLE=your-api-key-here
# Model Selection
CHAT_MODEL_GOOGLE=gemini-1.5-pro
UTILITY_MODEL_GOOGLE=gemini-1.5-flash
EMBEDDING_MODEL_GOOGLE=text-embedding-004Get your API key from aistudio.google.com.
Model Recommendations
| Use Case | Model | Notes |
|---|---|---|
| Best reasoning | gemini-1.5-pro | 1M token context, strong multimodal |
| Balanced | gemini-1.5-flash | Fast with good capability |
| Budget | gemini-1.5-flash-8b | Smallest and cheapest |
Setting Default Models
After configuring providers, set which models Agent Zero uses by default:
# Default Model Configuration
CHAT_MODEL_DEFAULT=gpt-4o
UTILITY_MODEL_DEFAULT=gpt-4o-mini
EMBEDDING_MODEL_DEFAULT=text-embedding-3-smallYou can mix providers. For example, use OpenAI for chat, Groq for utility tasks, and a local embedding model:
CHAT_MODEL_DEFAULT=gpt-4o
UTILITY_MODEL_DEFAULT=llama-3.1-8b-instant
EMBEDDING_MODEL_DEFAULT=nomic-embed-textLocal LLMs with Ollama
Running models locally eliminates API costs and keeps all data on your server. Ollama makes local LLM deployment straightforward.
Hardware Considerations
Local inference requires significant RAM. The model must fit entirely in memory:
| Model Parameters | Minimum RAM | Recommended RAM | Example Models |
|---|---|---|---|
| 7-8B | 8 GB | 12 GB | Llama 3.1 8B, Mistral 7B, Qwen2.5 7B |
| 13-14B | 16 GB | 20 GB | Qwen2.5 14B |
| 30-34B | 32 GB | 40 GB | Qwen2.5 32B, CodeLlama 34B |
| 70B | 64 GB | 80 GB | Llama 3.1 70B |
CPU inference is slower than GPU but entirely usable for async workflows. Expect 5-15 tokens/second on a modern CPU with 8B models, compared to 50+ tokens/second with a GPU. For most RamNode deployments, 8B parameter models hit the sweet spot.
Install Ollama
curl -fsSL https://ollama.com/install.sh | shVerify the installation:
ollama --versionOllama runs as a systemd service automatically:
sudo systemctl status ollamaPull Models
Download models you want to use. Start with a capable general-purpose model:
# Recommended starting model - excellent balance of capability and size
ollama pull qwen2.5:7b
# Alternative options
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull codellama:7b # Specialized for codeFor the utility model, a smaller variant works well:
ollama pull qwen2.5:3b
# or
ollama pull llama3.2:3bPull an embedding model:
ollama pull nomic-embed-text
# or
ollama pull mxbai-embed-largeList installed models:
ollama listTest Local Inference
Verify models work before configuring Agent Zero:
ollama run qwen2.5:7b "Write a Python function that calculates factorial"You should see the model generate a response. Exit with /bye or Ctrl+D. Check inference speed:
time ollama run qwen2.5:7b "What is 2+2?" --verboseThe --verbose flag shows tokens per second.
Configure Agent Zero for Ollama
Edit your .env file to use local models:
nano ~/agent-zero/.envAdd Ollama configuration:
# Ollama Configuration
API_URL_OLLAMA=http://localhost:11434
# Model Selection (use exact names from 'ollama list')
CHAT_MODEL_OLLAMA=qwen2.5:7b
UTILITY_MODEL_OLLAMA=qwen2.5:3b
EMBEDDING_MODEL_OLLAMA=nomic-embed-textSet Ollama models as defaults:
# Default to local models
CHAT_MODEL_DEFAULT=qwen2.5:7b
UTILITY_MODEL_DEFAULT=qwen2.5:3b
EMBEDDING_MODEL_DEFAULT=nomic-embed-textOllama Performance Tuning
Increase Context Length
By default, Ollama uses 2048 token context. For Agent Zero's complex workflows, increase this:
# Create a custom model with larger context
ollama create qwen2.5-32k -f - <<EOF
FROM qwen2.5:7b
PARAMETER num_ctx 32768
EOFUpdate your .env to use the custom model:
CHAT_MODEL_OLLAMA=qwen2.5-32kConfigure Memory Usage
Ollama automatically manages GPU/CPU memory, but you can tune behavior:
# Edit Ollama service configuration
sudo systemctl edit ollamaAdd environment variables:
[Service]
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2"Restart Ollama:
sudo systemctl restart ollamaKeep Models Loaded
By default, Ollama unloads models after 5 minutes of inactivity. For responsive agents, keep models loaded:
# Set longer keep-alive (in .env or environment)
OLLAMA_KEEP_ALIVE=24hOr load models persistently:
curl http://localhost:11434/api/generate -d '{"model": "qwen2.5:7b", "keep_alive": -1}'Recommended Local Model Combinations
Memory-Constrained (8GB RAM)
CHAT_MODEL_DEFAULT=qwen2.5:7b
UTILITY_MODEL_DEFAULT=qwen2.5:3b
EMBEDDING_MODEL_DEFAULT=nomic-embed-textBalanced (16GB RAM)
CHAT_MODEL_DEFAULT=qwen2.5:14b
UTILITY_MODEL_DEFAULT=qwen2.5:7b
EMBEDDING_MODEL_DEFAULT=mxbai-embed-largeCode-Focused (16GB RAM)
CHAT_MODEL_DEFAULT=deepseek-coder:6.7b
UTILITY_MODEL_DEFAULT=qwen2.5:3b
EMBEDDING_MODEL_DEFAULT=nomic-embed-textHybrid Configurations
The most practical setup often combines cloud and local models—using local inference for routine tasks and cloud APIs for complex reasoning.
Strategy 1: Local Utility, Cloud Chat
Use local models for frequent, simple operations while reserving cloud APIs for heavy lifting:
# Cloud for complex reasoning
CHAT_MODEL_DEFAULT=gpt-4o
# Local for utility tasks (no API cost)
UTILITY_MODEL_DEFAULT=qwen2.5:3b
# Local embeddings (runs constantly, saves significant cost)
EMBEDDING_MODEL_DEFAULT=nomic-embed-textThis dramatically reduces API costs since embedding and utility calls happen far more frequently than chat completions.
Strategy 2: Fast Local, Powerful Cloud Fallback
Use fast local models for initial attempts, escalating to cloud for difficult tasks:
# Start with local
CHAT_MODEL_DEFAULT=qwen2.5:7b
# Configure cloud as available alternative
CHAT_MODEL_OPENAI=gpt-4oYou can then instruct Agent Zero in custom prompts to escalate to more powerful models when local inference struggles.
Strategy 3: Provider Redundancy
Configure multiple providers for reliability:
# Primary
CHAT_MODEL_DEFAULT=gpt-4o
API_KEY_OPENAI=sk-...
# Backup
CHAT_MODEL_ANTHROPIC=claude-sonnet-4-20250514
API_KEY_ANTHROPIC=sk-ant-...
# Local fallback
CHAT_MODEL_OLLAMA=qwen2.5:7bIf one provider has an outage or rate limits you, alternatives are ready.
API Key Security
API keys grant access to paid services. Protect them:
Restrict File Permissions
chmod 600 ~/agent-zero/.envThis ensures only your user can read the file.
Use Environment Variables
For production, consider loading keys from environment variables rather than files:
# In ~/.bashrc or service file
export API_KEY_OPENAI="sk-..."Then reference in .env:
API_KEY_OPENAI=${API_KEY_OPENAI}Set Usage Limits
Most providers let you set spending caps:
- OpenAI: Settings → Limits → Set monthly budget
- Anthropic: Settings → Limits → Usage limits
- Google: Cloud Console → Budgets & Alerts
Set alerts at 50% and 80% of your budget to catch runaway usage.
Rotate Keys Periodically
Generate new API keys monthly and revoke old ones. This limits exposure if a key is compromised.
Testing Your Configuration
After configuring providers, restart Agent Zero and verify each model works:
# Restart if running as service
sudo systemctl restart agent-zero
# Or restart manually
cd ~/agent-zero
source venv/bin/activate
python run_ui.pyTest Prompts
Test chat model:
"Write a Python script that fetches weather data from an API, parses the JSON response, and formats it nicely for terminal output."
Test utility model (happens automatically with memory operations):
"Remember that my favorite programming language is Python."
Test embeddings (happens automatically with knowledge queries):
"Search your knowledge for information about Python."
Check logs for any model loading errors:
journalctl -u agent-zero -f
# And in another terminal:
journalctl -u ollama -fProvider Comparison Summary
| Provider | Speed | Cost | Privacy | Best For |
|---|---|---|---|---|
| OpenAI | Fast | Medium | Cloud | General use, broad capability |
| Anthropic | Fast | Medium | Cloud | Complex reasoning, long context |
| Groq | Very Fast | Low | Cloud | Interactive use, speed-critical |
| Fast | Low/Free | Cloud | Budget-conscious, multimodal | |
| Ollama | Slower | Free | Local | Privacy, no ongoing costs |
What's Next
Your Agent Zero instance can now leverage multiple LLM providers, from powerful cloud APIs to fully private local models. In Part 4: Memory Systems & Knowledge Management, we'll explore:
- How Agent Zero's memory architecture works
- Configuring persistent storage for agent learning
- Building custom knowledge bases from your documents
- Setting up SearXNG for private web search
- Optimizing memory for long-running agents
The memory system is what transforms Agent Zero from a stateless chatbot into a genuinely useful assistant that improves over time.
