Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LLM Providers

General Bots supports multiple Large Language Model (LLM) providers, both cloud-based services and local deployments. This guide helps you choose the right provider for your use case.

Overview

LLMs are the intelligence behind General Bots’ conversational capabilities. You can configure:

  • Cloud Providers — External APIs (OpenAI, Anthropic, Google, etc.)
  • Local Models — Self-hosted models via llama.cpp
  • Hybrid — Use local for simple tasks, cloud for complex reasoning

Cloud Providers

OpenAI (GPT Series)

The most widely known LLM provider, offering the GPT-5 flagship model.

ModelContextBest ForSpeed
GPT-51MAll-in-one advanced reasoningMedium
GPT-oss 120B128KOpen-weight, agent workflowsMedium
GPT-oss 20B128KCost-effective open-weightFast

Configuration (config.csv):

name,value
llm-provider,openai
llm-model,gpt-5

Strengths:

  • Most advanced all-in-one model
  • Excellent general knowledge
  • Strong code generation
  • Good instruction following

Considerations:

  • API costs can add up
  • Data sent to external servers
  • Rate limits apply

Anthropic (Claude Series)

Known for safety, helpfulness, and extended thinking capabilities.

ModelContextBest ForSpeed
Claude Opus 4.5200KMost capable, complex reasoningSlow
Claude Sonnet 4.5200KBest balance of capability/speedFast

Configuration (config.csv):

name,value
llm-provider,anthropic
llm-model,claude-sonnet-4.5

Strengths:

  • Extended thinking mode for multi-step tasks
  • Excellent at following complex instructions
  • Strong coding abilities
  • Better at refusing harmful requests

Considerations:

  • Premium pricing
  • Newer provider, smaller ecosystem

Google (Gemini & Vertex AI)

Google’s multimodal AI models with strong reasoning capabilities. General Bots natively supports both the public AI Studio API and Enterprise Vertex AI.

ModelContextBest ForSpeed
Gemini 1.5 Pro2MComplex reasoning, benchmarksMedium
Gemini 1.5 Flash1MFast multimodal tasksFast

Configuration for AI Studio (Public API):

name,value
llm-provider,google
llm-model,gemini-1.5-pro
llm-url,https://generativelanguage.googleapis.com
llm-key,AIza...

Configuration for Vertex AI (Enterprise):

name,value
llm-provider,vertex
llm-model,gemini-1.5-pro
llm-url,https://us-central1-aiplatform.googleapis.com
llm-key,~/.vertex.json

Note: The bots will handle the Google OAuth2 JWT authentication internally if you provide the path or the raw JSON to a Service Account.

Strengths:

  • Largest context window (2M tokens)
  • Native multimodal (text, image, video, audio)
  • Vertex AI support enables enterprise VPC/IAM integration

Considerations:

  • Different endpoints for public vs enterprise deployments

xAI (Grok Series)

Integration with real-time data from X platform.

ModelContextBest ForSpeed
Grok 4128KReal-time research, analysisFast

Configuration (config.csv):

name,value
llm-provider,xai
llm-model,grok-4

Strengths:

  • Real-time data access from X
  • Strong research and analysis
  • Good for trend analysis

Considerations:

  • Newer provider
  • X platform integration focus

Groq

Ultra-fast inference using custom LPU hardware. Offers open-source models at high speed.

ModelContextBest ForSpeed
Llama 4 Scout10MLong context, multimodalVery Fast
Llama 4 Maverick1MComplex tasksVery Fast
Qwen3128KEfficient MoE architectureExtremely Fast

Configuration (config.csv):

name,value
llm-provider,groq
llm-model,llama-4-scout

Strengths:

  • Fastest inference speeds (500+ tokens/sec)
  • Competitive pricing
  • Open-source models
  • Great for real-time applications

Considerations:

  • Rate limits on free tier
  • Models may be less capable than GPT-5/Claude

Mistral AI

European AI company offering efficient, open-weight models.

ModelContextBest ForSpeed
Mixtral-8x22B64KMulti-language, codingFast

Configuration (config.csv):

name,value
llm-provider,mistral
llm-model,mixtral-8x22b

Strengths:

  • European data sovereignty (GDPR)
  • Excellent code generation
  • Open-weight models available
  • Competitive pricing
  • Proficient in multiple languages

Considerations:

  • Smaller context than competitors
  • Less brand recognition

DeepSeek

Known for efficient, capable models with exceptional reasoning.

ModelContextBest ForSpeed
DeepSeek-V3.1128KGeneral purpose, optimized costFast
DeepSeek-R3128KReasoning, math, scienceMedium

Configuration (config.csv):

name,value
llm-provider,deepseek
llm-model,deepseek-r3
llm-server-url,https://api.deepseek.com

Strengths:

  • Extremely cost-effective
  • Strong reasoning (R1 model)
  • Rivals proprietary leaders in performance
  • Open-weight versions available (MIT/Apache 2.0)

Considerations:

  • Data processed in China
  • Newer provider

Amazon Bedrock

AWS managed service for foundation models, supporting Claude, Llama, Titan, and others.

ModelContextBest ForSpeed
Claude 3.5 Sonnet200KHigh capability tasksFast
Llama 3.1 70B128KOpen-weight performanceFast

Configuration (config.csv):

name,value
llm-provider,bedrock
llm-model,anthropic.claude-3-5-sonnet-20240620-v1:0
llm-url,https://bedrock-runtime.us-east-1.amazonaws.com/model/anthropic.claude-3-5-sonnet-20240620-v1:0/invoke
llm-key,YOUR_BEDROCK_API_KEY

Strengths:

  • Native AWS integration
  • Enterprise-grade security
  • Multiple model families in one API

Azure OpenAI

Enterprise-grade deployment of OpenAI models hosted on Microsoft Azure.

ModelContextBest ForSpeed
GPT-4o128KAdvanced multimodalFast

Configuration (config.csv):

name,value
llm-provider,azureclaude
llm-model,gpt-4o
llm-url,https://YOUR_RESOURCE.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT/chat/completions?api-version=2024-02-15-preview
llm-key,YOUR_AZURE_API_KEY

Strengths:

  • High enterprise compliance (HIPAA, SOC2)
  • Azure VNet integration
  • Guaranteed provisioned throughput available

Cerebras

Ultra-fast inference powered by Wafer-Scale Engine hardware, specifically tuned for open-source models like Llama.

ModelContextBest ForSpeed
Llama 3.1 70B8KHigh-speed general tasksExtremely Fast

Configuration (config.csv):

name,value
llm-provider,cerebras
llm-model,llama3.1-8b
llm-url,https://api.cerebras.ai/v1/chat/completions
llm-key,YOUR_CEREBRAS_API_KEY

Strengths:

  • Highest tokens-per-second available
  • Excellent for real-time agent loops

Zhipu AI (GLM)

High-capability bilingual models (English/Chinese) directly competing with state-of-the-art global models.

ModelContextBest ForSpeed
GLM-4128KGeneral purposeMedium
GLM-4-Long1MLong document analysisMedium

Configuration (config.csv):

name,value
llm-provider,glm
llm-model,glm-4
llm-url,https://open.bigmodel.cn/api/paas/v4/chat/completions
llm-key,YOUR_ZHIPU_API_KEY

Strengths:

  • Excellent bilingual performance
  • Large context windows (up to 1M)

Local Models

Run models on your own hardware for privacy, cost control, and offline operation.

Setting Up Local LLM

General Bots uses llama.cpp server for local inference:

name,value
llm-provider,local
llm-server-url,http://localhost:8081
llm-model,DeepSeek-R3-Distill-Qwen-1.5B

For High-End GPU (24GB+ VRAM)

ModelSizeVRAMQuality
Llama 4 Scout 17B Q818GB24GBExcellent
Qwen3 72B Q442GB48GB+Excellent
DeepSeek-R3 32B Q420GB24GBVery Good

For Mid-Range GPU (12-16GB VRAM)

ModelSizeVRAMQuality
Qwen3 14B Q815GB16GBVery Good
GPT-oss 20B Q412GB16GBVery Good
DeepSeek-R3-Distill 14B Q48GB12GBGood
Gemma 3 27B Q416GB16GBGood

For Small GPU or CPU (8GB VRAM or less)

ModelSizeVRAMQuality
DeepSeek-R3-Distill 1.5B Q41GB4GBBasic
Gemma 2 9B Q45GB8GBAcceptable
Gemma 3 27B Q210GB8GBAcceptable

Model Download URLs

Add models to installer.rs data_download_list:

#![allow(unused)]
fn main() {
// Qwen3 14B - Recommended for mid-range GPU
"https://huggingface.co/Qwen/Qwen3-14B-GGUF/resolve/main/qwen3-14b-q4_k_m.gguf"

// DeepSeek R1 Distill - For CPU or minimal GPU
"https://huggingface.co/unsloth/DeepSeek-R3-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R3-Distill-Qwen-1.5B-Q4_K_M.gguf"

// GPT-oss 20B - Good balance for agents
"https://huggingface.co/openai/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-q4_k_m.gguf"

// Gemma 3 27B - For quality local inference
"https://huggingface.co/google/gemma-3-27b-it-GGUF/resolve/main/gemma-3-27b-it-q4_k_m.gguf"
}

Embedding Models

For vector search, you need an embedding model:

name,value
embedding-provider,local
embedding-server-url,http://localhost:8082
embedding-model,bge-small-en-v1.5

Recommended embedding models:

ModelDimensionsSizeQuality
bge-small-en-v1.5384130MBGood
bge-base-en-v1.5768440MBBetter
bge-large-en-v1.510241.3GBBest
nomic-embed-text768550MBGood

Hybrid Configuration

Use different models for different tasks:

name,value
llm-provider,anthropic
llm-model,claude-sonnet-4.5
llm-fast-provider,groq
llm-fast-model,llama-3.3-70b
llm-fallback-provider,local
llm-fallback-model,DeepSeek-R3-Distill-Qwen-1.5B
embedding-provider,local
embedding-model,bge-small-en-v1.5

Model Selection Guide

By Use Case

Use CaseRecommendedWhy
Customer supportClaude Sonnet 4.5Best at following guidelines
Code generationDeepSeek-R3, Claude Sonnet 4.5Specialized for code
Document analysisGemini Pro2M context window
Real-time chatGroq Llama 3.3Fastest responses
Privacy-sensitiveLocal DeepSeek-R3No external data transfer
Cost-sensitiveDeepSeek, Local modelsLowest cost per token
Complex reasoningClaude Opus, Gemini ProBest reasoning ability
Real-time researchGrokLive data access
Long contextGemini Pro, ClaudeLargest context windows

By Budget

BudgetRecommended Setup
FreeLocal models only
Low ($10-50/mo)Groq + Local fallback
Medium ($50-200/mo)DeepSeek-V3.1 + Claude Sonnet 4.5
High ($200+/mo)GPT-5 + Claude Opus 4.5
EnterprisePrivate deployment + premium APIs

Configuration Reference

config.csv Parameters

All LLM configuration belongs in config.csv, not environment variables:

ParameterDescriptionExample
llm-providerProvider nameopenai, anthropic, local
llm-modelModel identifiergpt-5
llm-server-urlAPI endpoint (local only)http://localhost:8081
llm-server-ctx-sizeContext window size128000
llm-temperatureResponse randomness (0-2)0.7
llm-max-tokensMaximum response length4096
llm-cache-enabledEnable semantic cachingtrue
llm-cache-ttlCache time-to-live (seconds)3600

API Keys

API keys are stored in Vault, not in config files or environment variables:

# Store API key in Vault
vault kv put gbo/llm/openai api_key="sk-..."
vault kv put gbo/llm/anthropic api_key="sk-ant-..."
vault kv put gbo/llm/google api_key="AIza..."

Reference in config.csv:

name,value
llm-provider,openai
llm-model,gpt-5
llm-api-key,vault:gbo/llm/openai/api_key

Security Considerations

Cloud Providers

  • API keys stored in Vault, never in config files
  • Consider data residency requirements (EU: Mistral)
  • Review provider data retention policies
  • Use separate keys for production/development

Local Models

  • All data stays on your infrastructure
  • No internet required after model download
  • Full control over model versions
  • Consider GPU security for sensitive deployments

Performance Optimization

Caching

Enable semantic caching to reduce API calls:

name,value
llm-cache-enabled,true
llm-cache-ttl,3600
llm-cache-similarity-threshold,0.92

Batching

For bulk operations, use batch APIs when available:

name,value
llm-batch-enabled,true
llm-batch-size,10

Context Management

Optimize context window usage with episodic memory:

name,value
episodic-memory-enabled,true
episodic-memory-threshold,4
episodic-memory-history,2
episodic-memory-auto-summarize,true

See Episodic Memory for details.

Troubleshooting

Common Issues

API Key Invalid

  • Verify key is stored correctly in Vault
  • Check if key has required permissions
  • Ensure billing is active on provider account

Model Not Found

  • Check model name spelling
  • Verify model is available in your region
  • Some models require waitlist access

Rate Limits

  • Implement exponential backoff
  • Use caching to reduce calls
  • Consider upgrading API tier

Local Model Slow

  • Check GPU memory usage
  • Reduce context size
  • Use quantized models (Q4 instead of F16)

Logging

Enable LLM logging for debugging:

name,value
llm-log-requests,true
llm-log-responses,false
llm-log-timing,true

2025 Model Comparison

ModelCreatorTypeStrengths
GPT-5OpenAIProprietaryMost advanced all-in-one
Claude Opus/Sonnet 4.5AnthropicProprietaryExtended thinking, complex reasoning
Gemini 1.5/3 ProGoogleProprietaryBenchmarks, reasoning, 2M context
Grok 4xAIProprietaryReal-time X data
Claude / LlamaAmazon BedrockManaged APIEnterprise AWS integration
GPT-4o / GPT-5Azure OpenAIManaged APIEnterprise compliance, Azure VNet
Llama / Open ModelsCerebrasHardware CloudExtreme inference speed
GLM-4Zhipu AIProprietaryEnglish/Chinese bilingual, up to 1M context
DeepSeek-V3.1/R1DeepSeekOpen (MIT/Apache)Cost-optimized, reasoning
Llama 4MetaOpen-weight10M context, multimodal
Qwen3AlibabaOpen (Apache)Efficient MoE
Mixtral-8x22BMistralOpen (Apache)Multi-language, coding
GPT-ossOpenAIOpen (Apache)Agent workflows
Gemma 2/3GoogleOpen-weightLightweight, efficient

Next Steps