LLM Providers
General Bots supports multiple Large Language Model (LLM) providers, both cloud-based services and local deployments. This guide helps you choose the right provider for your use case.
Overview
LLMs are the intelligence behind General Bots’ conversational capabilities. You can configure:
- Cloud Providers — External APIs (OpenAI, Anthropic, Google, etc.)
- Local Models — Self-hosted models via llama.cpp
- Hybrid — Use local for simple tasks, cloud for complex reasoning
Cloud Providers
OpenAI (GPT Series)
The most widely known LLM provider, offering the GPT-5 flagship model.
| Model | Context | Best For | Speed |
|---|---|---|---|
| GPT-5 | 1M | All-in-one advanced reasoning | Medium |
| GPT-oss 120B | 128K | Open-weight, agent workflows | Medium |
| GPT-oss 20B | 128K | Cost-effective open-weight | Fast |
Configuration (config.csv):
name,value
llm-provider,openai
llm-model,gpt-5
Strengths:
- Most advanced all-in-one model
- Excellent general knowledge
- Strong code generation
- Good instruction following
Considerations:
- API costs can add up
- Data sent to external servers
- Rate limits apply
Anthropic (Claude Series)
Known for safety, helpfulness, and extended thinking capabilities.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Claude Opus 4.5 | 200K | Most capable, complex reasoning | Slow |
| Claude Sonnet 4.5 | 200K | Best balance of capability/speed | Fast |
Configuration (config.csv):
name,value
llm-provider,anthropic
llm-model,claude-sonnet-4.5
Strengths:
- Extended thinking mode for multi-step tasks
- Excellent at following complex instructions
- Strong coding abilities
- Better at refusing harmful requests
Considerations:
- Premium pricing
- Newer provider, smaller ecosystem
Google (Gemini & Vertex AI)
Google’s multimodal AI models with strong reasoning capabilities. General Bots natively supports both the public AI Studio API and Enterprise Vertex AI.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Gemini 1.5 Pro | 2M | Complex reasoning, benchmarks | Medium |
| Gemini 1.5 Flash | 1M | Fast multimodal tasks | Fast |
Configuration for AI Studio (Public API):
name,value
llm-provider,google
llm-model,gemini-1.5-pro
llm-url,https://generativelanguage.googleapis.com
llm-key,AIza...
Configuration for Vertex AI (Enterprise):
name,value
llm-provider,vertex
llm-model,gemini-1.5-pro
llm-url,https://us-central1-aiplatform.googleapis.com
llm-key,~/.vertex.json
Note: The bots will handle the Google OAuth2 JWT authentication internally if you provide the path or the raw JSON to a Service Account.
Strengths:
- Largest context window (2M tokens)
- Native multimodal (text, image, video, audio)
- Vertex AI support enables enterprise VPC/IAM integration
Considerations:
- Different endpoints for public vs enterprise deployments
xAI (Grok Series)
Integration with real-time data from X platform.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Grok 4 | 128K | Real-time research, analysis | Fast |
Configuration (config.csv):
name,value
llm-provider,xai
llm-model,grok-4
Strengths:
- Real-time data access from X
- Strong research and analysis
- Good for trend analysis
Considerations:
- Newer provider
- X platform integration focus
Groq
Ultra-fast inference using custom LPU hardware. Offers open-source models at high speed.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Llama 4 Scout | 10M | Long context, multimodal | Very Fast |
| Llama 4 Maverick | 1M | Complex tasks | Very Fast |
| Qwen3 | 128K | Efficient MoE architecture | Extremely Fast |
Configuration (config.csv):
name,value
llm-provider,groq
llm-model,llama-4-scout
Strengths:
- Fastest inference speeds (500+ tokens/sec)
- Competitive pricing
- Open-source models
- Great for real-time applications
Considerations:
- Rate limits on free tier
- Models may be less capable than GPT-5/Claude
Mistral AI
European AI company offering efficient, open-weight models.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Mixtral-8x22B | 64K | Multi-language, coding | Fast |
Configuration (config.csv):
name,value
llm-provider,mistral
llm-model,mixtral-8x22b
Strengths:
- European data sovereignty (GDPR)
- Excellent code generation
- Open-weight models available
- Competitive pricing
- Proficient in multiple languages
Considerations:
- Smaller context than competitors
- Less brand recognition
DeepSeek
Known for efficient, capable models with exceptional reasoning.
| Model | Context | Best For | Speed |
|---|---|---|---|
| DeepSeek-V3.1 | 128K | General purpose, optimized cost | Fast |
| DeepSeek-R3 | 128K | Reasoning, math, science | Medium |
Configuration (config.csv):
name,value
llm-provider,deepseek
llm-model,deepseek-r3
llm-server-url,https://api.deepseek.com
Strengths:
- Extremely cost-effective
- Strong reasoning (R1 model)
- Rivals proprietary leaders in performance
- Open-weight versions available (MIT/Apache 2.0)
Considerations:
- Data processed in China
- Newer provider
Amazon Bedrock
AWS managed service for foundation models, supporting Claude, Llama, Titan, and others.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Claude 3.5 Sonnet | 200K | High capability tasks | Fast |
| Llama 3.1 70B | 128K | Open-weight performance | Fast |
Configuration (config.csv):
name,value
llm-provider,bedrock
llm-model,anthropic.claude-3-5-sonnet-20240620-v1:0
llm-url,https://bedrock-runtime.us-east-1.amazonaws.com/model/anthropic.claude-3-5-sonnet-20240620-v1:0/invoke
llm-key,YOUR_BEDROCK_API_KEY
Strengths:
- Native AWS integration
- Enterprise-grade security
- Multiple model families in one API
Azure OpenAI
Enterprise-grade deployment of OpenAI models hosted on Microsoft Azure.
| Model | Context | Best For | Speed |
|---|---|---|---|
| GPT-4o | 128K | Advanced multimodal | Fast |
Configuration (config.csv):
name,value
llm-provider,azureclaude
llm-model,gpt-4o
llm-url,https://YOUR_RESOURCE.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT/chat/completions?api-version=2024-02-15-preview
llm-key,YOUR_AZURE_API_KEY
Strengths:
- High enterprise compliance (HIPAA, SOC2)
- Azure VNet integration
- Guaranteed provisioned throughput available
Cerebras
Ultra-fast inference powered by Wafer-Scale Engine hardware, specifically tuned for open-source models like Llama.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Llama 3.1 70B | 8K | High-speed general tasks | Extremely Fast |
Configuration (config.csv):
name,value
llm-provider,cerebras
llm-model,llama3.1-8b
llm-url,https://api.cerebras.ai/v1/chat/completions
llm-key,YOUR_CEREBRAS_API_KEY
Strengths:
- Highest tokens-per-second available
- Excellent for real-time agent loops
Zhipu AI (GLM)
High-capability bilingual models (English/Chinese) directly competing with state-of-the-art global models.
| Model | Context | Best For | Speed |
|---|---|---|---|
| GLM-4 | 128K | General purpose | Medium |
| GLM-4-Long | 1M | Long document analysis | Medium |
Configuration (config.csv):
name,value
llm-provider,glm
llm-model,glm-4
llm-url,https://open.bigmodel.cn/api/paas/v4/chat/completions
llm-key,YOUR_ZHIPU_API_KEY
Strengths:
- Excellent bilingual performance
- Large context windows (up to 1M)
Local Models
Run models on your own hardware for privacy, cost control, and offline operation.
Setting Up Local LLM
General Bots uses llama.cpp server for local inference:
name,value
llm-provider,local
llm-server-url,http://localhost:8081
llm-model,DeepSeek-R3-Distill-Qwen-1.5B
Recommended Local Models
For High-End GPU (24GB+ VRAM)
| Model | Size | VRAM | Quality |
|---|---|---|---|
| Llama 4 Scout 17B Q8 | 18GB | 24GB | Excellent |
| Qwen3 72B Q4 | 42GB | 48GB+ | Excellent |
| DeepSeek-R3 32B Q4 | 20GB | 24GB | Very Good |
For Mid-Range GPU (12-16GB VRAM)
| Model | Size | VRAM | Quality |
|---|---|---|---|
| Qwen3 14B Q8 | 15GB | 16GB | Very Good |
| GPT-oss 20B Q4 | 12GB | 16GB | Very Good |
| DeepSeek-R3-Distill 14B Q4 | 8GB | 12GB | Good |
| Gemma 3 27B Q4 | 16GB | 16GB | Good |
For Small GPU or CPU (8GB VRAM or less)
| Model | Size | VRAM | Quality |
|---|---|---|---|
| DeepSeek-R3-Distill 1.5B Q4 | 1GB | 4GB | Basic |
| Gemma 2 9B Q4 | 5GB | 8GB | Acceptable |
| Gemma 3 27B Q2 | 10GB | 8GB | Acceptable |
Model Download URLs
Add models to installer.rs data_download_list:
#![allow(unused)]
fn main() {
// Qwen3 14B - Recommended for mid-range GPU
"https://huggingface.co/Qwen/Qwen3-14B-GGUF/resolve/main/qwen3-14b-q4_k_m.gguf"
// DeepSeek R1 Distill - For CPU or minimal GPU
"https://huggingface.co/unsloth/DeepSeek-R3-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R3-Distill-Qwen-1.5B-Q4_K_M.gguf"
// GPT-oss 20B - Good balance for agents
"https://huggingface.co/openai/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-q4_k_m.gguf"
// Gemma 3 27B - For quality local inference
"https://huggingface.co/google/gemma-3-27b-it-GGUF/resolve/main/gemma-3-27b-it-q4_k_m.gguf"
}
Embedding Models
For vector search, you need an embedding model:
name,value
embedding-provider,local
embedding-server-url,http://localhost:8082
embedding-model,bge-small-en-v1.5
Recommended embedding models:
| Model | Dimensions | Size | Quality |
|---|---|---|---|
| bge-small-en-v1.5 | 384 | 130MB | Good |
| bge-base-en-v1.5 | 768 | 440MB | Better |
| bge-large-en-v1.5 | 1024 | 1.3GB | Best |
| nomic-embed-text | 768 | 550MB | Good |
Hybrid Configuration
Use different models for different tasks:
name,value
llm-provider,anthropic
llm-model,claude-sonnet-4.5
llm-fast-provider,groq
llm-fast-model,llama-3.3-70b
llm-fallback-provider,local
llm-fallback-model,DeepSeek-R3-Distill-Qwen-1.5B
embedding-provider,local
embedding-model,bge-small-en-v1.5
Model Selection Guide
By Use Case
| Use Case | Recommended | Why |
|---|---|---|
| Customer support | Claude Sonnet 4.5 | Best at following guidelines |
| Code generation | DeepSeek-R3, Claude Sonnet 4.5 | Specialized for code |
| Document analysis | Gemini Pro | 2M context window |
| Real-time chat | Groq Llama 3.3 | Fastest responses |
| Privacy-sensitive | Local DeepSeek-R3 | No external data transfer |
| Cost-sensitive | DeepSeek, Local models | Lowest cost per token |
| Complex reasoning | Claude Opus, Gemini Pro | Best reasoning ability |
| Real-time research | Grok | Live data access |
| Long context | Gemini Pro, Claude | Largest context windows |
By Budget
| Budget | Recommended Setup |
|---|---|
| Free | Local models only |
| Low ($10-50/mo) | Groq + Local fallback |
| Medium ($50-200/mo) | DeepSeek-V3.1 + Claude Sonnet 4.5 |
| High ($200+/mo) | GPT-5 + Claude Opus 4.5 |
| Enterprise | Private deployment + premium APIs |
Configuration Reference
config.csv Parameters
All LLM configuration belongs in config.csv, not environment variables:
| Parameter | Description | Example |
|---|---|---|
llm-provider | Provider name | openai, anthropic, local |
llm-model | Model identifier | gpt-5 |
llm-server-url | API endpoint (local only) | http://localhost:8081 |
llm-server-ctx-size | Context window size | 128000 |
llm-temperature | Response randomness (0-2) | 0.7 |
llm-max-tokens | Maximum response length | 4096 |
llm-cache-enabled | Enable semantic caching | true |
llm-cache-ttl | Cache time-to-live (seconds) | 3600 |
API Keys
API keys are stored in Vault, not in config files or environment variables:
# Store API key in Vault
vault kv put gbo/llm/openai api_key="sk-..."
vault kv put gbo/llm/anthropic api_key="sk-ant-..."
vault kv put gbo/llm/google api_key="AIza..."
Reference in config.csv:
name,value
llm-provider,openai
llm-model,gpt-5
llm-api-key,vault:gbo/llm/openai/api_key
Security Considerations
Cloud Providers
- API keys stored in Vault, never in config files
- Consider data residency requirements (EU: Mistral)
- Review provider data retention policies
- Use separate keys for production/development
Local Models
- All data stays on your infrastructure
- No internet required after model download
- Full control over model versions
- Consider GPU security for sensitive deployments
Performance Optimization
Caching
Enable semantic caching to reduce API calls:
name,value
llm-cache-enabled,true
llm-cache-ttl,3600
llm-cache-similarity-threshold,0.92
Batching
For bulk operations, use batch APIs when available:
name,value
llm-batch-enabled,true
llm-batch-size,10
Context Management
Optimize context window usage with episodic memory:
name,value
episodic-memory-enabled,true
episodic-memory-threshold,4
episodic-memory-history,2
episodic-memory-auto-summarize,true
See Episodic Memory for details.
Troubleshooting
Common Issues
API Key Invalid
- Verify key is stored correctly in Vault
- Check if key has required permissions
- Ensure billing is active on provider account
Model Not Found
- Check model name spelling
- Verify model is available in your region
- Some models require waitlist access
Rate Limits
- Implement exponential backoff
- Use caching to reduce calls
- Consider upgrading API tier
Local Model Slow
- Check GPU memory usage
- Reduce context size
- Use quantized models (Q4 instead of F16)
Logging
Enable LLM logging for debugging:
name,value
llm-log-requests,true
llm-log-responses,false
llm-log-timing,true
2025 Model Comparison
| Model | Creator | Type | Strengths |
|---|---|---|---|
| GPT-5 | OpenAI | Proprietary | Most advanced all-in-one |
| Claude Opus/Sonnet 4.5 | Anthropic | Proprietary | Extended thinking, complex reasoning |
| Gemini 1.5/3 Pro | Proprietary | Benchmarks, reasoning, 2M context | |
| Grok 4 | xAI | Proprietary | Real-time X data |
| Claude / Llama | Amazon Bedrock | Managed API | Enterprise AWS integration |
| GPT-4o / GPT-5 | Azure OpenAI | Managed API | Enterprise compliance, Azure VNet |
| Llama / Open Models | Cerebras | Hardware Cloud | Extreme inference speed |
| GLM-4 | Zhipu AI | Proprietary | English/Chinese bilingual, up to 1M context |
| DeepSeek-V3.1/R1 | DeepSeek | Open (MIT/Apache) | Cost-optimized, reasoning |
| Llama 4 | Meta | Open-weight | 10M context, multimodal |
| Qwen3 | Alibaba | Open (Apache) | Efficient MoE |
| Mixtral-8x22B | Mistral | Open (Apache) | Multi-language, coding |
| GPT-oss | OpenAI | Open (Apache) | Agent workflows |
| Gemma 2/3 | Open-weight | Lightweight, efficient |
Next Steps
- config.csv Reference — Complete configuration guide
- Secrets Management — Vault integration
- Semantic Caching — Cache configuration
- NVIDIA GPU Setup — GPU configuration for local models