Part 3 of 6

LLM Provider Configuration

Configure cloud LLMs (OpenAI, Claude, Gemini, Groq) and local models with Ollama. Optimize for cost, privacy, or capability with hybrid setups.

15 minutes

Intermediate

Agent Zero is model-agnostic—it works with virtually any LLM provider, from cloud APIs to locally-hosted models. This flexibility lets you optimize for cost, privacy, speed, or capability depending on your needs.

Understanding Agent Zero's Model Architecture

Agent Zero uses three types of models for different purposes:

Chat Model

The primary reasoning engine. Handles complex tasks, code generation, and multi-step problem solving. This should be your most capable model.

Utility Model

Handles lightweight tasks like memory summarization, context compression, and quick lookups. Can be smaller and cheaper than the chat model.

Embedding Model

Converts text into vector representations for memory search and knowledge retrieval. Runs frequently but uses minimal resources.

This separation lets you allocate expensive, capable models where they matter most while using efficient models for routine operations.

OpenAI Configuration

OpenAI offers the most straightforward setup and remains a solid default choice. Edit your .env file:

Terminal

cd ~/agent-zero
nano .env

Configure OpenAI:

.env

# OpenAI Configuration
API_KEY_OPENAI=sk-your-api-key-here

# Model Selection
CHAT_MODEL_OPENAI=gpt-4o
UTILITY_MODEL_OPENAI=gpt-4o-mini
EMBEDDING_MODEL_OPENAI=text-embedding-3-small

Model Recommendations

Use Case	Model	Notes
Best reasoning	`gpt-4o`	Strongest for complex code and multi-step tasks
Balanced	`gpt-4o-mini`	Good capability at lower cost
Budget	`gpt-3.5-turbo`	Faster, cheaper, less capable

Anthropic Claude

Claude excels at nuanced reasoning and longer context windows. Configure it alongside or instead of OpenAI:

.env

# Anthropic Configuration
API_KEY_ANTHROPIC=sk-ant-your-api-key-here

# Model Selection
CHAT_MODEL_ANTHROPIC=claude-sonnet-4-20250514
UTILITY_MODEL_ANTHROPIC=claude-haiku-3-5-20241022
EMBEDDING_MODEL_ANTHROPIC=voyage-2  # Note: Anthropic uses Voyage for embeddings

Get your API key from console.anthropic.com.

Model Recommendations

Use Case	Model	Notes
Best reasoning	`claude-sonnet-4-20250514`	Excellent for code and analysis
Extended context	`claude-sonnet-4-20250514`	200K token context window
Budget	`claude-haiku-3-5-20241022`	Fast and cost-effective

Groq (Fast Inference)

Groq provides extremely fast inference using custom hardware. Response times are often 10x faster than other providers, making it excellent for interactive use:

.env

# Groq Configuration
API_KEY_GROQ=gsk_your-api-key-here

# Model Selection
CHAT_MODEL_GROQ=llama-3.3-70b-versatile
UTILITY_MODEL_GROQ=llama-3.1-8b-instant
EMBEDDING_MODEL_GROQ=nomic-embed-text  # Or use another provider

Get your API key from console.groq.com. Groq's free tier is generous for testing, and paid usage remains very affordable.

Model Recommendations

Use Case	Model	Notes
Best reasoning	`llama-3.3-70b-versatile`	Strong open model, very fast
Balanced	`mixtral-8x7b-32768`	Good MoE model with 32K context
Speed	`llama-3.1-8b-instant`	Blazing fast for simple tasks

Google Gemini

Gemini offers competitive models with generous free tiers:

.env

# Google Configuration
API_KEY_GOOGLE=your-api-key-here

# Model Selection
CHAT_MODEL_GOOGLE=gemini-1.5-pro
UTILITY_MODEL_GOOGLE=gemini-1.5-flash
EMBEDDING_MODEL_GOOGLE=text-embedding-004

Get your API key from aistudio.google.com.

Model Recommendations

Use Case	Model	Notes
Best reasoning	`gemini-1.5-pro`	1M token context, strong multimodal
Balanced	`gemini-1.5-flash`	Fast with good capability
Budget	`gemini-1.5-flash-8b`	Smallest and cheapest

Setting Default Models

After configuring providers, set which models Agent Zero uses by default:

.env

# Default Model Configuration
CHAT_MODEL_DEFAULT=gpt-4o
UTILITY_MODEL_DEFAULT=gpt-4o-mini
EMBEDDING_MODEL_DEFAULT=text-embedding-3-small

You can mix providers. For example, use OpenAI for chat, Groq for utility tasks, and a local embedding model:

.env

CHAT_MODEL_DEFAULT=gpt-4o
UTILITY_MODEL_DEFAULT=llama-3.1-8b-instant
EMBEDDING_MODEL_DEFAULT=nomic-embed-text

Local LLMs with Ollama

Running models locally eliminates API costs and keeps all data on your server. Ollama makes local LLM deployment straightforward.

Hardware Considerations

Local inference requires significant RAM. The model must fit entirely in memory:

Model Parameters	Minimum RAM	Recommended RAM	Example Models
7-8B	8 GB	12 GB	Llama 3.1 8B, Mistral 7B, Qwen2.5 7B
13-14B	16 GB	20 GB	Qwen2.5 14B
30-34B	32 GB	40 GB	Qwen2.5 32B, CodeLlama 34B
70B	64 GB	80 GB	Llama 3.1 70B

CPU inference is slower than GPU but entirely usable for async workflows. Expect 5-15 tokens/second on a modern CPU with 8B models, compared to 50+ tokens/second with a GPU. For most RamNode deployments, 8B parameter models hit the sweet spot.

Install Ollama

Terminal

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

Terminal

ollama --version

Ollama runs as a systemd service automatically:

Terminal

sudo systemctl status ollama

Pull Models

Download models you want to use. Start with a capable general-purpose model:

Terminal

# Recommended starting model - excellent balance of capability and size
ollama pull qwen2.5:7b

# Alternative options
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull codellama:7b  # Specialized for code

For the utility model, a smaller variant works well:

Terminal

ollama pull qwen2.5:3b
# or
ollama pull llama3.2:3b

Pull an embedding model:

Terminal

ollama pull nomic-embed-text
# or
ollama pull mxbai-embed-large

List installed models:

Terminal

ollama list

Test Local Inference

Verify models work before configuring Agent Zero:

Terminal

ollama run qwen2.5:7b "Write a Python function that calculates factorial"

You should see the model generate a response. Exit with /bye or Ctrl+D. Check inference speed:

Terminal

time ollama run qwen2.5:7b "What is 2+2?" --verbose

The --verbose flag shows tokens per second.

Configure Agent Zero for Ollama

Edit your .env file to use local models:

Terminal

nano ~/agent-zero/.env

Add Ollama configuration:

.env

# Ollama Configuration
API_URL_OLLAMA=http://localhost:11434

# Model Selection (use exact names from 'ollama list')
CHAT_MODEL_OLLAMA=qwen2.5:7b
UTILITY_MODEL_OLLAMA=qwen2.5:3b
EMBEDDING_MODEL_OLLAMA=nomic-embed-text

Set Ollama models as defaults:

.env

# Default to local models
CHAT_MODEL_DEFAULT=qwen2.5:7b
UTILITY_MODEL_DEFAULT=qwen2.5:3b
EMBEDDING_MODEL_DEFAULT=nomic-embed-text

Ollama Performance Tuning

Increase Context Length

By default, Ollama uses 2048 token context. For Agent Zero's complex workflows, increase this:

Terminal

# Create a custom model with larger context
ollama create qwen2.5-32k -f - <<EOF
FROM qwen2.5:7b
PARAMETER num_ctx 32768
EOF

Update your .env to use the custom model:

.env

CHAT_MODEL_OLLAMA=qwen2.5-32k

Configure Memory Usage

Ollama automatically manages GPU/CPU memory, but you can tune behavior:

Terminal

# Edit Ollama service configuration
sudo systemctl edit ollama

Add environment variables:

systemd override

[Service]
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2"

Restart Ollama:

Terminal

sudo systemctl restart ollama

Keep Models Loaded

By default, Ollama unloads models after 5 minutes of inactivity. For responsive agents, keep models loaded:

.env

# Set longer keep-alive (in .env or environment)
OLLAMA_KEEP_ALIVE=24h

Or load models persistently:

Terminal

curl http://localhost:11434/api/generate -d '{"model": "qwen2.5:7b", "keep_alive": -1}'

Recommended Local Model Combinations

Memory-Constrained (8GB RAM)

CHAT_MODEL_DEFAULT=qwen2.5:7b
UTILITY_MODEL_DEFAULT=qwen2.5:3b
EMBEDDING_MODEL_DEFAULT=nomic-embed-text

Balanced (16GB RAM)

CHAT_MODEL_DEFAULT=qwen2.5:14b
UTILITY_MODEL_DEFAULT=qwen2.5:7b
EMBEDDING_MODEL_DEFAULT=mxbai-embed-large

Code-Focused (16GB RAM)

CHAT_MODEL_DEFAULT=deepseek-coder:6.7b
UTILITY_MODEL_DEFAULT=qwen2.5:3b
EMBEDDING_MODEL_DEFAULT=nomic-embed-text

Hybrid Configurations

The most practical setup often combines cloud and local models—using local inference for routine tasks and cloud APIs for complex reasoning.

Strategy 1: Local Utility, Cloud Chat

Use local models for frequent, simple operations while reserving cloud APIs for heavy lifting:

.env

# Cloud for complex reasoning
CHAT_MODEL_DEFAULT=gpt-4o

# Local for utility tasks (no API cost)
UTILITY_MODEL_DEFAULT=qwen2.5:3b

# Local embeddings (runs constantly, saves significant cost)
EMBEDDING_MODEL_DEFAULT=nomic-embed-text

This dramatically reduces API costs since embedding and utility calls happen far more frequently than chat completions.

Strategy 2: Fast Local, Powerful Cloud Fallback

Use fast local models for initial attempts, escalating to cloud for difficult tasks:

.env

# Start with local
CHAT_MODEL_DEFAULT=qwen2.5:7b

# Configure cloud as available alternative
CHAT_MODEL_OPENAI=gpt-4o

You can then instruct Agent Zero in custom prompts to escalate to more powerful models when local inference struggles.

Strategy 3: Provider Redundancy

Configure multiple providers for reliability:

.env

# Primary
CHAT_MODEL_DEFAULT=gpt-4o
API_KEY_OPENAI=sk-...

# Backup
CHAT_MODEL_ANTHROPIC=claude-sonnet-4-20250514
API_KEY_ANTHROPIC=sk-ant-...

# Local fallback
CHAT_MODEL_OLLAMA=qwen2.5:7b

If one provider has an outage or rate limits you, alternatives are ready.

API Key Security

API keys grant access to paid services. Protect them:

Restrict File Permissions

Terminal

chmod 600 ~/agent-zero/.env

This ensures only your user can read the file.

Use Environment Variables

For production, consider loading keys from environment variables rather than files:

~/.bashrc

# In ~/.bashrc or service file
export API_KEY_OPENAI="sk-..."

Then reference in .env:

.env

API_KEY_OPENAI=${API_KEY_OPENAI}

Set Usage Limits

Most providers let you set spending caps:

OpenAI: Settings → Limits → Set monthly budget
Anthropic: Settings → Limits → Usage limits
Google: Cloud Console → Budgets & Alerts

Set alerts at 50% and 80% of your budget to catch runaway usage.

Rotate Keys Periodically

Generate new API keys monthly and revoke old ones. This limits exposure if a key is compromised.

Testing Your Configuration

After configuring providers, restart Agent Zero and verify each model works:

Terminal

# Restart if running as service
sudo systemctl restart agent-zero

# Or restart manually
cd ~/agent-zero
source venv/bin/activate
python run_ui.py

Test Prompts

Test chat model:

"Write a Python script that fetches weather data from an API, parses the JSON response, and formats it nicely for terminal output."

Test utility model (happens automatically with memory operations):

"Remember that my favorite programming language is Python."

Test embeddings (happens automatically with knowledge queries):

"Search your knowledge for information about Python."

Check logs for any model loading errors:

Terminal

journalctl -u agent-zero -f
# And in another terminal:
journalctl -u ollama -f

Provider Comparison Summary

Provider	Speed	Cost	Privacy	Best For
OpenAI	Fast	Medium	Cloud	General use, broad capability
Anthropic	Fast	Medium	Cloud	Complex reasoning, long context
Groq	Very Fast	Low	Cloud	Interactive use, speed-critical
Google	Fast	Low/Free	Cloud	Budget-conscious, multimodal
Ollama	Slower	Free	Local	Privacy, no ongoing costs

What's Next

Your Agent Zero instance can now leverage multiple LLM providers, from powerful cloud APIs to fully private local models. In Part 4: Memory Systems & Knowledge Management, we'll explore:

How Agent Zero's memory architecture works
Configuring persistent storage for agent learning
Building custom knowledge bases from your documents
Setting up SearXNG for private web search
Optimizing memory for long-running agents

The memory system is what transforms Agent Zero from a stateless chatbot into a genuinely useful assistant that improves over time.

Part 2: Docker Installation Part 4: Memory Systems