Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LLM Configuration

Configuration for Language Model integration in BotServer, supporting both local GGUF models and external API services.

Local Model Configuration

BotServer is designed to work with local GGUF models by default. The minimal configuration requires only a few settings in your config.csv:

llm-key,none
llm-url,http://localhost:8081
llm-model,../../../../data/llm/DeepSeek-R3-Distill-Qwen-1.5B-Q3_K_M.gguf

Model Path

The llm-model parameter accepts relative paths like ../../../../data/llm/model.gguf, absolute paths like /opt/models/model.gguf, or model names when using external APIs like gpt-5.

Supported Model Formats

BotServer supports GGUF quantized models for CPU and GPU inference. Quantization levels include Q3_K_M, Q4_K_M, and Q5_K_M for reduced memory usage with acceptable quality trade-offs, while F16 and F32 provide full precision for maximum quality.

LLM Server Configuration

Running Embedded Server

BotServer can run its own LLM server for local inference:

llm-server,true
llm-server-path,botserver-stack/bin/llm/build/bin
llm-server-host,0.0.0.0
llm-server-port,8081

Server Performance Parameters

Fine-tune server performance based on your hardware capabilities:

llm-server-gpu-layers,0
llm-server-ctx-size,4096
llm-server-n-predict,1024
llm-server-parallel,6
llm-server-cont-batching,true
ParameterDescriptionImpact
llm-server-gpu-layersLayers to offload to GPU0 = CPU only, higher = more GPU
llm-server-ctx-sizeContext window sizeMore context = more memory
llm-server-n-predictMax tokens to generateLimits response length
llm-server-parallelConcurrent requestsHigher = more throughput
llm-server-cont-batchingContinuous batchingImproves multi-user performance

Memory Management

Memory settings control how the model interacts with system RAM:

llm-server-mlock,false
llm-server-no-mmap,false

The mlock option locks the model in RAM to prevent swapping, which improves performance but requires sufficient memory. The no-mmap option disables memory mapping and loads the entire model into RAM, using more memory but potentially improving access patterns.

Cache Configuration

Basic Cache Settings

Caching reduces repeated LLM calls for identical inputs, significantly improving response times and reducing API costs:

llm-cache,false
llm-cache-ttl,3600

Semantic Cache

Semantic caching matches similar queries, not just identical ones, providing cache hits even when users phrase questions differently:

llm-cache-semantic,true
llm-cache-threshold,0.95

The threshold parameter controls how similar queries must be to trigger a cache hit. A value of 0.95 requires 95% similarity. Lower thresholds produce more cache hits but may return less accurate cached responses.

External API Configuration

Groq and OpenAI-Compatible APIs

For cloud inference, Groq offers the fastest performance among major providers:

llm-key,gsk-your-groq-api-key
llm-url,https://api.groq.com/openai/v1
llm-model,mixtral-8x7b-32768

Local API Servers

When running your own inference server or using another local service:

llm-key,none
llm-url,http://localhost:8081
llm-model,local-model-name

Configuration Examples

Minimal Local Setup

The simplest configuration for getting started with local models:

name,value
llm-url,http://localhost:8081
llm-model,../../../../data/llm/model.gguf

High-Performance Local

Optimized for maximum throughput on capable hardware:

name,value
llm-server,true
llm-server-gpu-layers,32
llm-server-ctx-size,8192
llm-server-parallel,8
llm-server-cont-batching,true
llm-cache,true
llm-cache-semantic,true

Low-Resource Setup

Configured for systems with limited RAM or CPU:

name,value
llm-server-ctx-size,2048
llm-server-n-predict,512
llm-server-parallel,2
llm-cache,false
llm-server-mlock,false

External API

Using a cloud provider for inference:

name,value
llm-key,sk-...
llm-url,https://api.anthropic.com
llm-model,claude-sonnet-4.5
llm-cache,true
llm-cache-ttl,7200

Performance Tuning

For Responsiveness

When response speed is the priority, decrease llm-server-ctx-size and llm-server-n-predict to reduce processing time. Enable both llm-cache and llm-cache-semantic to serve repeated queries instantly.

For Quality

When output quality matters most, increase llm-server-ctx-size and llm-server-n-predict to give the model more context and generation headroom. Use higher quantization models like Q5_K_M or F16 for better accuracy. Either disable semantic cache entirely or raise the threshold to avoid returning imprecise cached responses.

For Multiple Users

Supporting concurrent users requires enabling llm-server-cont-batching and increasing llm-server-parallel to handle multiple requests simultaneously. Enable caching to reduce redundant inference calls. If available, GPU offloading significantly improves throughput under load.

Model Selection Guidelines

Small Models (1-3B parameters)

Small models like DeepSeek-R3-Distill-Qwen-1.5B deliver fast responses with low memory usage. They work well for simple tasks, quick interactions, and resource-constrained environments.

Medium Models (7-13B parameters)

Medium-sized models such as Llama-2-7B and Mistral-7B provide balanced performance suitable for general-purpose applications. They require moderate memory but handle a wide range of tasks competently.

Large Models (30B+ parameters)

Large models like Llama-2-70B and Mixtral-8x7B offer the best quality for complex reasoning tasks. They require substantial memory and compute resources but excel at nuanced understanding and generation.

Troubleshooting

Model Won’t Load

If the model fails to load, first verify the file path exists and is accessible. Check that your system has sufficient RAM for the model size. Ensure the GGUF file version is compatible with your llama.cpp build.

Slow Responses

Slow generation typically indicates resource constraints. Reduce context size, enable caching to avoid redundant inference, use GPU offloading if hardware permits, or switch to a smaller quantized model.

Out of Memory

Memory errors require reducing resource consumption. Lower llm-server-ctx-size and llm-server-parallel values. Switch to more aggressively quantized models (Q3 instead of Q5). Disable llm-server-mlock to allow the OS to manage memory more flexibly.

Connection Refused

Connection errors usually indicate server configuration issues. Verify llm-server is set to true if expecting BotServer to run the server. Check that the configured port is not already in use by another process. Ensure firewall rules allow connections on the specified port.

Best Practices

Start with smaller models and scale up only as needed, since larger models consume more resources without always providing proportionally better results. Enable caching for any production deployment to reduce costs and improve response times. Monitor RAM usage during operation to catch memory pressure before it causes problems. Test model responses thoroughly before deploying to production to ensure quality meets requirements. Document which models you’re using and their performance characteristics. Track changes to your config.csv in version control to maintain a history of configuration adjustments.