NVIDIA GPU Setup for LXC Containers

This guide covers setting up NVIDIA GPU passthrough for botserver running in LXC containers, enabling hardware acceleration for local LLM inference.

Prerequisites

NVIDIA GPU (RTX 3060 or better with 12GB+ VRAM recommended)
NVIDIA drivers installed on the host system
LXD/LXC installed
CUDA-capable GPU

LXD Configuration (Interactive Setup)

When initializing LXD, use these settings:

sudo lxd init

Answer the prompts as follows:

Would you like to use LXD clustering? → no
Do you want to configure a new storage pool? → no (will create /generalbots later)
Would you like to connect to a MAAS server? → no
Would you like to create a new local network bridge? → yes
What should the new bridge be called? → lxdbr0
What IPv4 address should be used? → auto
What IPv6 address should be used? → auto
Would you like the LXD server to be available over the network? → no
Would you like stale cached images to be updated automatically? → no
Would you like a YAML “lxd init” preseed to be printed? → no

Storage Configuration

Storage backend name: → default
Storage backend driver: → zfs
Create a new ZFS pool? → yes

NVIDIA GPU Configuration

On the Host System

Create a GPU profile and attach it to your container:

# Create GPU profile
lxc profile create gpu

# Add GPU device to profile
lxc profile device add gpu gpu gpu gputype=physical

# Apply GPU profile to your container
lxc profile add gb-system gpu

Inside the Container

Configure NVIDIA driver version pinning and install drivers:

Pin NVIDIA driver versions to ensure stability:

cat > /etc/apt/preferences.d/nvidia-drivers << 'EOF'
Package: *nvidia*
Pin: version 560.35.05-1
Pin-Priority: 1001

Package: cuda-drivers*
Pin: version 560.35.05-1
Pin-Priority: 1001

Package: libcuda*
Pin: version 560.35.05-1
Pin-Priority: 1001

Package: libxnvctrl* 
Pin: version 560.35.05-1
Pin-Priority: 1001

Package: libnv*
Pin: version 560.35.05-1
Pin-Priority: 1001
EOF

Install NVIDIA drivers and CUDA toolkit:

# Update package lists
apt update

# Install NVIDIA driver and nvidia-smi
apt install -y nvidia-driver nvidia-smi

# Add CUDA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb

# Install CUDA toolkit
apt-get update
apt-get -y install cuda-toolkit-12-8
apt-get install -y cuda-drivers

Verify GPU Access

After installation, verify GPU is accessible:

# Check GPU is visible
nvidia-smi

# Should show your GPU with driver version 560.35.05

Configure botserver for GPU

Update your bot’s config.csv to use GPU acceleration:

name,value
llm-server-gpu-layers,35

The number of layers depends on your GPU memory:

RTX 3060 (12GB): 20-35 layers
RTX 3070 (8GB): 15-25 layers
RTX 4070 (12GB): 30-40 layers
RTX 4090 (24GB): 50-99 layers

Troubleshooting

GPU Not Detected

If nvidia-smi doesn’t show the GPU:

Check host GPU drivers:

# On host
nvidia-smi
lxc config device list gb-system

Verify GPU passthrough:
```
# Inside container
ls -la /dev/nvidia*
```
Check kernel modules:
```
lsmod | grep nvidia
```

Driver Version Mismatch

If you encounter driver version conflicts:

Ensure host and container use the same driver version

Remove the version pinning file and install matching drivers:

rm /etc/apt/preferences.d/nvidia-drivers
apt update
apt install nvidia-driver-560

CUDA Library Issues

If CUDA libraries aren’t found:

# Add CUDA to library path
echo '/usr/local/cuda/lib64' >> /etc/ld.so.conf.d/cuda.conf
ldconfig

# Add to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Custom llama.cpp Compilation

If you need custom CPU/GPU optimizations or specific hardware support, compile llama.cpp from source:

Prerequisites

sudo apt update
sudo apt install build-essential cmake git

Compilation Steps

# Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Create build directory
mkdir build
cd build

# Configure with CUDA support
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF

# Compile using all available cores
make -j$(nproc)

Compilation Options

For different hardware configurations:

# CPU-only build (no GPU)
cmake .. -DLLAMA_CURL=OFF

# CUDA with specific compute capability
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CUDA_FORCE_COMPUTE=75

# ROCm for AMD GPUs
cmake .. -DLLAMA_HIPBLAS=ON

# Metal for Apple Silicon
cmake .. -DLLAMA_METAL=ON

# AVX2 optimizations for modern CPUs
cmake .. -DLLAMA_AVX2=ON

# F16C for half-precision support
cmake .. -DLLAMA_F16C=ON

After Compilation

# Copy compiled binary to botserver
cp bin/llama-server /path/to/botserver-stack/bin/llm/

# Update config.csv to use custom build
llm-server-path,/path/to/botserver-stack/bin/llm/

Benefits of Custom Compilation

Hardware-specific optimizations for your exact CPU/GPU
Custom CUDA compute capabilities for newer GPUs
AVX/AVX2/AVX512 instructions for faster CPU inference
Reduced binary size by excluding unused features
Support for experimental features not in releases

Performance Optimization

Memory Settings

For optimal LLM performance with GPU:

name,value
llm-server-gpu-layers,35
llm-server-mlock,true
llm-server-no-mmap,false
llm-server-ctx-size,4096

Multiple GPUs

For systems with multiple GPUs, specify which GPU to use:

# List available GPUs
lxc profile device add gpu gpu0 gpu gputype=physical id=0
lxc profile device add gpu gpu1 gpu gputype=physical id=1

Benefits of GPU Acceleration

With GPU acceleration enabled:

5-10x faster inference compared to CPU
Higher context sizes possible (8K-32K tokens)
Real-time responses even with large models
Lower CPU usage for other tasks
Support for larger models (13B, 30B parameters)

Next Steps

Installation Guide - Complete botserver setup
Quick Start - Create your first bot
Configuration Reference - All GPU-related parameters

Keyboard shortcuts

General Bots Documentation