Scaling and Load Balancing
General Bots is designed to scale from a single instance to a distributed cluster using LXC containers. This chapter covers auto-scaling, load balancing, sharding strategies, and failover systems.
Scaling Architecture
General Bots uses a horizontal scaling approach with LXC containers:
┌─────────────────┐
│ Caddy Proxy │
│ (Load Balancer)│
└────────┬────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ LXC Container │ │ LXC Container │ │ LXC Container │
│ botserver-1 │ │ botserver-2 │ │ botserver-3 │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ PostgreSQL │ │ Redis │ │ Qdrant │
│ (Primary) │ │ (Cluster) │ │ (Cluster) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Auto-Scaling Configuration
config.csv Parameters
Configure auto-scaling behavior in your bot’s config.csv:
# Auto-scaling settings
scale-enabled,true
scale-min-instances,1
scale-max-instances,10
scale-cpu-threshold,70
scale-memory-threshold,80
scale-request-threshold,1000
scale-cooldown-seconds,300
scale-check-interval,30
| Parameter | Description | Default |
|---|---|---|
scale-enabled | Enable auto-scaling | false |
scale-min-instances | Minimum container count | 1 |
scale-max-instances | Maximum container count | 10 |
scale-cpu-threshold | CPU % to trigger scale-up | 70 |
scale-memory-threshold | Memory % to trigger scale-up | 80 |
scale-request-threshold | Requests/min to trigger scale-up | 1000 |
scale-cooldown-seconds | Wait time between scaling events | 300 |
scale-check-interval | Seconds between metric checks | 30 |
Scaling Rules
Define custom scaling rules:
# Scale up when average response time exceeds 2 seconds
scale-rule-response-time,2000
scale-rule-response-action,up
# Scale down when CPU drops below 30%
scale-rule-cpu-low,30
scale-rule-cpu-low-action,down
# Scale up on queue depth
scale-rule-queue-depth,100
scale-rule-queue-action,up
LXC Container Management
Creating Scaled Instances
# Create additional botserver containers
for i in {2..5}; do
lxc launch images:debian/12 botserver-$i
lxc config device add botserver-$i port-$((8080+i)) proxy \
listen=tcp:0.0.0.0:$((8080+i)) connect=tcp:127.0.0.1:8080
done
Container Resource Limits
Set resource limits per container:
# CPU limits (number of cores)
lxc config set botserver-1 limits.cpu 4
# Memory limits
lxc config set botserver-1 limits.memory 8GB
# Disk I/O priority (0-10)
lxc config set botserver-1 limits.disk.priority 5
# Network bandwidth (ingress/egress)
lxc config device set botserver-1 eth0 limits.ingress 100Mbit
lxc config device set botserver-1 eth0 limits.egress 100Mbit
Auto-Scaling Script
Create /opt/gbo/scripts/autoscale.sh:
#!/bin/bash
# Configuration
MIN_INSTANCES=1
MAX_INSTANCES=10
CPU_THRESHOLD=70
SCALE_COOLDOWN=300
LAST_SCALE_FILE="/tmp/last_scale_time"
get_avg_cpu() {
local total=0
local count=0
for container in $(lxc list -c n --format csv | grep "^botserver-"); do
cpu=$(lxc exec $container -- cat /proc/loadavg | awk '{print $1}')
total=$(echo "$total + $cpu" | bc)
count=$((count + 1))
done
echo "scale=2; $total / $count * 100" | bc
}
get_instance_count() {
lxc list -c n --format csv | grep -c "^botserver-"
}
can_scale() {
if [ ! -f "$LAST_SCALE_FILE" ]; then
return 0
fi
last_scale=$(cat "$LAST_SCALE_FILE")
now=$(date +%s)
diff=$((now - last_scale))
[ $diff -gt $SCALE_COOLDOWN ]
}
scale_up() {
current=$(get_instance_count)
if [ $current -ge $MAX_INSTANCES ]; then
echo "Already at max instances ($MAX_INSTANCES)"
return 1
fi
new_id=$((current + 1))
echo "Scaling up: creating botserver-$new_id"
lxc launch images:debian/12 botserver-$new_id
lxc config set botserver-$new_id limits.cpu 4
lxc config set botserver-$new_id limits.memory 8GB
# Copy configuration
lxc file push /opt/gbo/conf/botserver.env botserver-$new_id/opt/gbo/conf/
# Start botserver
lxc exec botserver-$new_id -- /opt/gbo/bin/botserver &
# Update load balancer
update_load_balancer
date +%s > "$LAST_SCALE_FILE"
echo "Scale up complete"
}
scale_down() {
current=$(get_instance_count)
if [ $current -le $MIN_INSTANCES ]; then
echo "Already at min instances ($MIN_INSTANCES)"
return 1
fi
# Remove highest numbered instance
target="botserver-$current"
echo "Scaling down: removing $target"
# Drain connections
lxc exec $target -- /opt/gbo/bin/botserver drain
sleep 30
# Stop and delete
lxc stop $target
lxc delete $target
# Update load balancer
update_load_balancer
date +%s > "$LAST_SCALE_FILE"
echo "Scale down complete"
}
update_load_balancer() {
# Generate upstream list
upstreams=""
for container in $(lxc list -c n --format csv | grep "^botserver-"); do
ip=$(lxc list $container -c 4 --format csv | cut -d' ' -f1)
upstreams="$upstreams\n to $ip:8080"
done
# Update Caddy config
cat > /opt/gbo/conf/caddy/upstream.conf << EOF
upstream botserver {
$upstreams
lb_policy round_robin
health_uri /api/health
health_interval 10s
}
EOF
# Reload Caddy
lxc exec proxy-1 -- caddy reload --config /etc/caddy/Caddyfile
}
# Main loop
while true; do
avg_cpu=$(get_avg_cpu)
echo "Average CPU: $avg_cpu%"
if can_scale; then
if (( $(echo "$avg_cpu > $CPU_THRESHOLD" | bc -l) )); then
scale_up
elif (( $(echo "$avg_cpu < 30" | bc -l) )); then
scale_down
fi
fi
sleep 30
done
Load Balancing
Caddy Configuration
Primary load balancer configuration (/opt/gbo/conf/caddy/Caddyfile):
{
admin off
auto_https on
}
(common) {
encode gzip zstd
header {
-Server
X-Content-Type-Options "nosniff"
X-Frame-Options "DENY"
Referrer-Policy "strict-origin-when-cross-origin"
}
}
bot.example.com {
import common
# Health check endpoint (no load balancing)
handle /api/health {
reverse_proxy localhost:8080
}
# WebSocket connections (sticky sessions)
handle /ws* {
reverse_proxy botserver-1:8080 botserver-2:8080 botserver-3:8080 {
lb_policy cookie
lb_try_duration 5s
health_uri /api/health
health_interval 10s
health_timeout 5s
}
}
# API requests (round robin)
handle /api/* {
reverse_proxy botserver-1:8080 botserver-2:8080 botserver-3:8080 {
lb_policy round_robin
lb_try_duration 5s
health_uri /api/health
health_interval 10s
fail_duration 30s
}
}
# Static files (any instance)
handle {
reverse_proxy botserver-1:8080 botserver-2:8080 botserver-3:8080 {
lb_policy first
}
}
}
Load Balancing Policies
| Policy | Description | Use Case |
|---|---|---|
round_robin | Rotate through backends | General API requests |
first | Use first available | Static content |
least_conn | Fewest active connections | Long-running requests |
ip_hash | Consistent by client IP | Session affinity |
cookie | Sticky sessions via cookie | WebSocket, stateful |
random | Random selection | Testing |
Rate Limiting
Configure rate limits in config.csv:
# Rate limiting
rate-limit-enabled,true
rate-limit-requests,100
rate-limit-window,60
rate-limit-burst,20
rate-limit-by,ip
# Per-endpoint limits
rate-limit-api-chat,30
rate-limit-api-files,50
rate-limit-api-auth,10
Rate limiting in Caddy:
bot.example.com {
# Global rate limit
rate_limit {
zone global {
key {remote_host}
events 100
window 1m
}
}
# Stricter limit for auth endpoints
handle /api/auth/* {
rate_limit {
zone auth {
key {remote_host}
events 10
window 1m
}
}
reverse_proxy botserver:8080
}
}
Sharding Strategies
Database Sharding Options
Option 1: Tenant-Based Sharding
Each tenant gets their own database:
┌─────────────────┐
│ Router/Proxy │
└────────┬────────┘
│
┌────┴────┬──────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Tenant1│ │Tenant2│ │Tenant3│
│ DB │ │ DB │ │ DB │
└───────┘ └───────┘ └───────┘
Configuration:
# Tenant sharding
shard-strategy,tenant
shard-tenant-db-prefix,gb_tenant_
shard-auto-create,true
Option 2: Hash-Based Sharding
Distribute data by hash of primary key:
User ID: 12345
Hash: 12345 % 4 = 1
Shard: shard-1
Configuration:
# Hash sharding
shard-strategy,hash
shard-count,4
shard-key,user_id
shard-algorithm,modulo
Option 3: Range-Based Sharding
Partition by ID ranges:
# Range sharding
shard-strategy,range
shard-ranges,0-999999:shard1,1000000-1999999:shard2,2000000-:shard3
Option 4: Geographic Sharding
Route by user location:
# Geographic sharding
shard-strategy,geo
shard-geo-us,postgres-us.example.com
shard-geo-eu,postgres-eu.example.com
shard-geo-asia,postgres-asia.example.com
shard-default,postgres-us.example.com
Vector Database Sharding (Qdrant)
Qdrant supports automatic sharding:
# Qdrant sharding
qdrant-shard-count,4
qdrant-replication-factor,2
qdrant-write-consistency,majority
Collection creation with sharding:
#![allow(unused)] fn main() { // In vectordb code let collection_config = CreateCollection { collection_name: format!("kb_{}", bot_id), vectors_config: VectorsConfig::Single(VectorParams { size: 384, distance: Distance::Cosine, }), shard_number: Some(4), replication_factor: Some(2), write_consistency_factor: Some(1), ..Default::default() }; }
Redis Cluster
For high-availability caching:
# Redis cluster
cache-mode,cluster
cache-nodes,redis-1:6379,redis-2:6379,redis-3:6379
cache-replicas,1
Failover Systems
Health Checks
Configure health check endpoints:
# Health check configuration
health-enabled,true
health-endpoint,/api/health
health-interval,10
health-timeout,5
health-retries,3
Health check response:
{
"status": "healthy",
"version": "6.1.0",
"uptime": 86400,
"checks": {
"database": "ok",
"cache": "ok",
"vectordb": "ok",
"llm": "ok"
},
"metrics": {
"cpu": 45.2,
"memory": 62.1,
"connections": 150
}
}
Automatic Failover
Database Failover (PostgreSQL)
Using Patroni for PostgreSQL HA:
# patroni.yml
scope: botserver-cluster
name: postgres-1
restapi:
listen: 0.0.0.0:8008
connect_address: postgres-1:8008
etcd:
hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
max_connections: 200
shared_buffers: 2GB
postgresql:
listen: 0.0.0.0:5432
connect_address: postgres-1:5432
data_dir: /var/lib/postgresql/data
authentication:
superuser:
username: postgres
password: ${POSTGRES_PASSWORD}
replication:
username: replicator
password: ${REPLICATION_PASSWORD}
Cache Failover (Redis Sentinel)
# Redis Sentinel configuration
cache-mode,sentinel
cache-sentinel-master,mymaster
cache-sentinel-nodes,sentinel-1:26379,sentinel-2:26379,sentinel-3:26379
Circuit Breaker
Prevent cascade failures:
# Circuit breaker settings
circuit-breaker-enabled,true
circuit-breaker-threshold,5
circuit-breaker-timeout,30
circuit-breaker-half-open-requests,3
States:
- Closed: Normal operation
- Open: Failing, reject requests immediately
- Half-Open: Testing if service recovered
Graceful Degradation
Configure fallback behavior:
# Fallback configuration
fallback-llm-enabled,true
fallback-llm-provider,local
fallback-llm-model,DeepSeek-R3-Distill-Qwen-1.5B
fallback-cache-enabled,true
fallback-cache-mode,memory
fallback-vectordb-enabled,true
fallback-vectordb-mode,keyword-search
Monitoring Scaling
Metrics Collection
Key metrics to monitor:
# Scaling metrics
metrics-scaling-enabled,true
metrics-container-count,true
metrics-scaling-events,true
metrics-load-distribution,true
Alerting Rules
Configure alerts for scaling issues:
# alerting-rules.yml
groups:
- name: scaling
rules:
- alert: HighCPUUsage
expr: avg(cpu_usage) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
- alert: MaxInstancesReached
expr: container_count >= max_instances
for: 1m
labels:
severity: critical
annotations:
summary: "Maximum instances reached, cannot scale up"
- alert: ScalingFailed
expr: scaling_errors > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Scaling operation failed"
Best Practices
Scaling
- Start small - Begin with auto-scaling disabled, monitor patterns first
- Set appropriate thresholds - Too low causes thrashing, too high causes poor performance
- Use cooldown periods - Prevent rapid scale up/down cycles
- Test failover - Regularly test your failover procedures
- Monitor costs - More instances = higher infrastructure costs
Load Balancing
- Use sticky sessions for WebSockets - Required for real-time features
- Enable health checks - Remove unhealthy instances automatically
- Configure timeouts - Prevent hanging connections
- Use connection pooling - Reduce connection overhead
Sharding
- Choose the right strategy - Tenant-based is simplest for SaaS
- Plan for rebalancing - Have procedures to move data between shards
- Avoid cross-shard queries - Design to minimize these
- Monitor shard balance - Uneven distribution causes hotspots
Next Steps
- Container Deployment - LXC container basics
- Architecture Overview - System design
- Monitoring Dashboard - Observe your cluster