LLM Configuration
Configuration for Language Model integration in BotServer, supporting both local GGUF models and external API services.
Local Model Configuration
BotServer is designed to work with local GGUF models by default. The minimal configuration requires only a few settings in your config.csv:
llm-key,none
llm-url,http://localhost:8081
llm-model,../../../../data/llm/DeepSeek-R3-Distill-Qwen-1.5B-Q3_K_M.gguf
Model Path
The llm-model parameter accepts relative paths like ../../../../data/llm/model.gguf, absolute paths like /opt/models/model.gguf, or model names when using external APIs like gpt-5.
Supported Model Formats
BotServer supports GGUF quantized models for CPU and GPU inference. Quantization levels include Q3_K_M, Q4_K_M, and Q5_K_M for reduced memory usage with acceptable quality trade-offs, while F16 and F32 provide full precision for maximum quality.
LLM Server Configuration
Running Embedded Server
BotServer can run its own LLM server for local inference:
llm-server,true
llm-server-path,botserver-stack/bin/llm/build/bin
llm-server-host,0.0.0.0
llm-server-port,8081
Server Performance Parameters
Fine-tune server performance based on your hardware capabilities:
llm-server-gpu-layers,0
llm-server-ctx-size,4096
llm-server-n-predict,1024
llm-server-parallel,6
llm-server-cont-batching,true
| Parameter | Description | Impact |
|---|---|---|
llm-server-gpu-layers | Layers to offload to GPU | 0 = CPU only, higher = more GPU |
llm-server-ctx-size | Context window size | More context = more memory |
llm-server-n-predict | Max tokens to generate | Limits response length |
llm-server-parallel | Concurrent requests | Higher = more throughput |
llm-server-cont-batching | Continuous batching | Improves multi-user performance |
Memory Management
Memory settings control how the model interacts with system RAM:
llm-server-mlock,false
llm-server-no-mmap,false
The mlock option locks the model in RAM to prevent swapping, which improves performance but requires sufficient memory. The no-mmap option disables memory mapping and loads the entire model into RAM, using more memory but potentially improving access patterns.
Cache Configuration
Basic Cache Settings
Caching reduces repeated LLM calls for identical inputs, significantly improving response times and reducing API costs:
llm-cache,false
llm-cache-ttl,3600
Semantic Cache
Semantic caching matches similar queries, not just identical ones, providing cache hits even when users phrase questions differently:
llm-cache-semantic,true
llm-cache-threshold,0.95
The threshold parameter controls how similar queries must be to trigger a cache hit. A value of 0.95 requires 95% similarity. Lower thresholds produce more cache hits but may return less accurate cached responses.
External API Configuration
Groq and OpenAI-Compatible APIs
For cloud inference, Groq offers the fastest performance among major providers:
llm-key,gsk-your-groq-api-key
llm-url,https://api.groq.com/openai/v1
llm-model,mixtral-8x7b-32768
Local API Servers
When running your own inference server or using another local service:
llm-key,none
llm-url,http://localhost:8081
llm-model,local-model-name
Configuration Examples
Minimal Local Setup
The simplest configuration for getting started with local models:
name,value
llm-url,http://localhost:8081
llm-model,../../../../data/llm/model.gguf
High-Performance Local
Optimized for maximum throughput on capable hardware:
name,value
llm-server,true
llm-server-gpu-layers,32
llm-server-ctx-size,8192
llm-server-parallel,8
llm-server-cont-batching,true
llm-cache,true
llm-cache-semantic,true
Low-Resource Setup
Configured for systems with limited RAM or CPU:
name,value
llm-server-ctx-size,2048
llm-server-n-predict,512
llm-server-parallel,2
llm-cache,false
llm-server-mlock,false
External API
Using a cloud provider for inference:
name,value
llm-key,sk-...
llm-url,https://api.anthropic.com
llm-model,claude-sonnet-4.5
llm-cache,true
llm-cache-ttl,7200
Performance Tuning
For Responsiveness
When response speed is the priority, decrease llm-server-ctx-size and llm-server-n-predict to reduce processing time. Enable both llm-cache and llm-cache-semantic to serve repeated queries instantly.
For Quality
When output quality matters most, increase llm-server-ctx-size and llm-server-n-predict to give the model more context and generation headroom. Use higher quantization models like Q5_K_M or F16 for better accuracy. Either disable semantic cache entirely or raise the threshold to avoid returning imprecise cached responses.
For Multiple Users
Supporting concurrent users requires enabling llm-server-cont-batching and increasing llm-server-parallel to handle multiple requests simultaneously. Enable caching to reduce redundant inference calls. If available, GPU offloading significantly improves throughput under load.
Model Selection Guidelines
Small Models (1-3B parameters)
Small models like DeepSeek-R3-Distill-Qwen-1.5B deliver fast responses with low memory usage. They work well for simple tasks, quick interactions, and resource-constrained environments.
Medium Models (7-13B parameters)
Medium-sized models such as Llama-2-7B and Mistral-7B provide balanced performance suitable for general-purpose applications. They require moderate memory but handle a wide range of tasks competently.
Large Models (30B+ parameters)
Large models like Llama-2-70B and Mixtral-8x7B offer the best quality for complex reasoning tasks. They require substantial memory and compute resources but excel at nuanced understanding and generation.
Troubleshooting
Model Won’t Load
If the model fails to load, first verify the file path exists and is accessible. Check that your system has sufficient RAM for the model size. Ensure the GGUF file version is compatible with your llama.cpp build.
Slow Responses
Slow generation typically indicates resource constraints. Reduce context size, enable caching to avoid redundant inference, use GPU offloading if hardware permits, or switch to a smaller quantized model.
Out of Memory
Memory errors require reducing resource consumption. Lower llm-server-ctx-size and llm-server-parallel values. Switch to more aggressively quantized models (Q3 instead of Q5). Disable llm-server-mlock to allow the OS to manage memory more flexibly.
Connection Refused
Connection errors usually indicate server configuration issues. Verify llm-server is set to true if expecting BotServer to run the server. Check that the configured port is not already in use by another process. Ensure firewall rules allow connections on the specified port.
Best Practices
Start with smaller models and scale up only as needed, since larger models consume more resources without always providing proportionally better results. Enable caching for any production deployment to reduce costs and improve response times. Monitor RAM usage during operation to catch memory pressure before it causes problems. Test model responses thoroughly before deploying to production to ensure quality meets requirements. Document which models you’re using and their performance characteristics. Track changes to your config.csv in version control to maintain a history of configuration adjustments.