Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Scaling and Load Balancing

General Bots is designed to scale from a single instance to a distributed cluster using LXC containers. This chapter covers auto-scaling, load balancing, sharding strategies, and failover systems.

Scaling Architecture

General Bots uses a horizontal scaling approach with LXC containers:

                    ┌─────────────────┐
                    │   Caddy Proxy   │
                    │  (Load Balancer)│
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│  LXC Container  │ │  LXC Container  │ │  LXC Container  │
│   botserver-1   │ │   botserver-2   │ │   botserver-3   │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│   PostgreSQL    │ │     Redis       │ │     Qdrant      │
│   (Primary)     │ │   (Cluster)     │ │   (Cluster)     │
└─────────────────┘ └─────────────────┘ └─────────────────┘

Auto-Scaling Configuration

config.csv Parameters

Configure auto-scaling behavior in your bot’s config.csv:

# Auto-scaling settings
scale-enabled,true
scale-min-instances,1
scale-max-instances,10
scale-cpu-threshold,70
scale-memory-threshold,80
scale-request-threshold,1000
scale-cooldown-seconds,300
scale-check-interval,30
ParameterDescriptionDefault
scale-enabledEnable auto-scalingfalse
scale-min-instancesMinimum container count1
scale-max-instancesMaximum container count10
scale-cpu-thresholdCPU % to trigger scale-up70
scale-memory-thresholdMemory % to trigger scale-up80
scale-request-thresholdRequests/min to trigger scale-up1000
scale-cooldown-secondsWait time between scaling events300
scale-check-intervalSeconds between metric checks30

Scaling Rules

Define custom scaling rules:

# Scale up when average response time exceeds 2 seconds
scale-rule-response-time,2000
scale-rule-response-action,up

# Scale down when CPU drops below 30%
scale-rule-cpu-low,30
scale-rule-cpu-low-action,down

# Scale up on queue depth
scale-rule-queue-depth,100
scale-rule-queue-action,up

LXC Container Management

Creating Scaled Instances

# Create additional botserver containers
for i in {2..5}; do
  lxc launch images:debian/12 botserver-$i
  lxc config device add botserver-$i port-$((8080+i)) proxy \
    listen=tcp:0.0.0.0:$((8080+i)) connect=tcp:127.0.0.1:8080
done

Container Resource Limits

Set resource limits per container:

# CPU limits (number of cores)
lxc config set botserver-1 limits.cpu 4

# Memory limits
lxc config set botserver-1 limits.memory 8GB

# Disk I/O priority (0-10)
lxc config set botserver-1 limits.disk.priority 5

# Network bandwidth (ingress/egress)
lxc config device set botserver-1 eth0 limits.ingress 100Mbit
lxc config device set botserver-1 eth0 limits.egress 100Mbit

Auto-Scaling Script

Create /opt/gbo/scripts/autoscale.sh:

#!/bin/bash

# Configuration
MIN_INSTANCES=1
MAX_INSTANCES=10
CPU_THRESHOLD=70
SCALE_COOLDOWN=300
LAST_SCALE_FILE="/tmp/last_scale_time"

get_avg_cpu() {
    local total=0
    local count=0
    for container in $(lxc list -c n --format csv | grep "^botserver-"); do
        cpu=$(lxc exec $container -- cat /proc/loadavg | awk '{print $1}')
        total=$(echo "$total + $cpu" | bc)
        count=$((count + 1))
    done
    echo "scale=2; $total / $count * 100" | bc
}

get_instance_count() {
    lxc list -c n --format csv | grep -c "^botserver-"
}

can_scale() {
    if [ ! -f "$LAST_SCALE_FILE" ]; then
        return 0
    fi
    last_scale=$(cat "$LAST_SCALE_FILE")
    now=$(date +%s)
    diff=$((now - last_scale))
    [ $diff -gt $SCALE_COOLDOWN ]
}

scale_up() {
    current=$(get_instance_count)
    if [ $current -ge $MAX_INSTANCES ]; then
        echo "Already at max instances ($MAX_INSTANCES)"
        return 1
    fi
    
    new_id=$((current + 1))
    echo "Scaling up: creating botserver-$new_id"
    
    lxc launch images:debian/12 botserver-$new_id
    lxc config set botserver-$new_id limits.cpu 4
    lxc config set botserver-$new_id limits.memory 8GB
    
    # Copy configuration
    lxc file push /opt/gbo/conf/botserver.env botserver-$new_id/opt/gbo/conf/
    
    # Start botserver
    lxc exec botserver-$new_id -- /opt/gbo/bin/botserver &
    
    # Update load balancer
    update_load_balancer
    
    date +%s > "$LAST_SCALE_FILE"
    echo "Scale up complete"
}

scale_down() {
    current=$(get_instance_count)
    if [ $current -le $MIN_INSTANCES ]; then
        echo "Already at min instances ($MIN_INSTANCES)"
        return 1
    fi
    
    # Remove highest numbered instance
    target="botserver-$current"
    echo "Scaling down: removing $target"
    
    # Drain connections
    lxc exec $target -- /opt/gbo/bin/botserver drain
    sleep 30
    
    # Stop and delete
    lxc stop $target
    lxc delete $target
    
    # Update load balancer
    update_load_balancer
    
    date +%s > "$LAST_SCALE_FILE"
    echo "Scale down complete"
}

update_load_balancer() {
    # Generate upstream list
    upstreams=""
    for container in $(lxc list -c n --format csv | grep "^botserver-"); do
        ip=$(lxc list $container -c 4 --format csv | cut -d' ' -f1)
        upstreams="$upstreams\n        to $ip:8080"
    done
    
    # Update Caddy config
    cat > /opt/gbo/conf/caddy/upstream.conf << EOF
upstream botserver {
    $upstreams
    lb_policy round_robin
    health_uri /api/health
    health_interval 10s
}
EOF
    
    # Reload Caddy
    lxc exec proxy-1 -- caddy reload --config /etc/caddy/Caddyfile
}

# Main loop
while true; do
    avg_cpu=$(get_avg_cpu)
    echo "Average CPU: $avg_cpu%"
    
    if can_scale; then
        if (( $(echo "$avg_cpu > $CPU_THRESHOLD" | bc -l) )); then
            scale_up
        elif (( $(echo "$avg_cpu < 30" | bc -l) )); then
            scale_down
        fi
    fi
    
    sleep 30
done

Load Balancing

Caddy Configuration

Primary load balancer configuration (/opt/gbo/conf/caddy/Caddyfile):

{
    admin off
    auto_https on
}

(common) {
    encode gzip zstd
    header {
        -Server
        X-Content-Type-Options "nosniff"
        X-Frame-Options "DENY"
        Referrer-Policy "strict-origin-when-cross-origin"
    }
}

bot.example.com {
    import common
    
    # Health check endpoint (no load balancing)
    handle /api/health {
        reverse_proxy localhost:8080
    }
    
    # WebSocket connections (sticky sessions)
    handle /ws* {
        reverse_proxy botserver-1:8080 botserver-2:8080 botserver-3:8080 {
            lb_policy cookie
            lb_try_duration 5s
            health_uri /api/health
            health_interval 10s
            health_timeout 5s
        }
    }
    
    # API requests (round robin)
    handle /api/* {
        reverse_proxy botserver-1:8080 botserver-2:8080 botserver-3:8080 {
            lb_policy round_robin
            lb_try_duration 5s
            health_uri /api/health
            health_interval 10s
            fail_duration 30s
        }
    }
    
    # Static files (any instance)
    handle {
        reverse_proxy botserver-1:8080 botserver-2:8080 botserver-3:8080 {
            lb_policy first
        }
    }
}

Load Balancing Policies

PolicyDescriptionUse Case
round_robinRotate through backendsGeneral API requests
firstUse first availableStatic content
least_connFewest active connectionsLong-running requests
ip_hashConsistent by client IPSession affinity
cookieSticky sessions via cookieWebSocket, stateful
randomRandom selectionTesting

Rate Limiting

Configure rate limits in config.csv:

# Rate limiting
rate-limit-enabled,true
rate-limit-requests,100
rate-limit-window,60
rate-limit-burst,20
rate-limit-by,ip

# Per-endpoint limits
rate-limit-api-chat,30
rate-limit-api-files,50
rate-limit-api-auth,10

Rate limiting in Caddy:

bot.example.com {
    # Global rate limit
    rate_limit {
        zone global {
            key {remote_host}
            events 100
            window 1m
        }
    }
    
    # Stricter limit for auth endpoints
    handle /api/auth/* {
        rate_limit {
            zone auth {
                key {remote_host}
                events 10
                window 1m
            }
        }
        reverse_proxy botserver:8080
    }
}

Sharding Strategies

Database Sharding Options

Option 1: Tenant-Based Sharding

Each tenant gets their own database:

┌─────────────────┐
│   Router/Proxy  │
└────────┬────────┘
         │
    ┌────┴────┬──────────┐
    │         │          │
    ▼         ▼          ▼
┌───────┐ ┌───────┐ ┌───────┐
│Tenant1│ │Tenant2│ │Tenant3│
│  DB   │ │  DB   │ │  DB   │
└───────┘ └───────┘ └───────┘

Configuration:

# Tenant sharding
shard-strategy,tenant
shard-tenant-db-prefix,gb_tenant_
shard-auto-create,true

Option 2: Hash-Based Sharding

Distribute data by hash of primary key:

User ID: 12345
Hash: 12345 % 4 = 1
Shard: shard-1

Configuration:

# Hash sharding
shard-strategy,hash
shard-count,4
shard-key,user_id
shard-algorithm,modulo

Option 3: Range-Based Sharding

Partition by ID ranges:

# Range sharding
shard-strategy,range
shard-ranges,0-999999:shard1,1000000-1999999:shard2,2000000-:shard3

Option 4: Geographic Sharding

Route by user location:

# Geographic sharding
shard-strategy,geo
shard-geo-us,postgres-us.example.com
shard-geo-eu,postgres-eu.example.com
shard-geo-asia,postgres-asia.example.com
shard-default,postgres-us.example.com

Vector Database Sharding (Qdrant)

Qdrant supports automatic sharding:

# Qdrant sharding
qdrant-shard-count,4
qdrant-replication-factor,2
qdrant-write-consistency,majority

Collection creation with sharding:

#![allow(unused)]
fn main() {
// In vectordb code
let collection_config = CreateCollection {
    collection_name: format!("kb_{}", bot_id),
    vectors_config: VectorsConfig::Single(VectorParams {
        size: 384,
        distance: Distance::Cosine,
    }),
    shard_number: Some(4),
    replication_factor: Some(2),
    write_consistency_factor: Some(1),
    ..Default::default()
};
}

Redis Cluster

For high-availability caching:

# Redis cluster
cache-mode,cluster
cache-nodes,redis-1:6379,redis-2:6379,redis-3:6379
cache-replicas,1

Failover Systems

Health Checks

Configure health check endpoints:

# Health check configuration
health-enabled,true
health-endpoint,/api/health
health-interval,10
health-timeout,5
health-retries,3

Health check response:

{
  "status": "healthy",
  "version": "6.1.0",
  "uptime": 86400,
  "checks": {
    "database": "ok",
    "cache": "ok",
    "vectordb": "ok",
    "llm": "ok"
  },
  "metrics": {
    "cpu": 45.2,
    "memory": 62.1,
    "connections": 150
  }
}

Automatic Failover

Database Failover (PostgreSQL)

Using Patroni for PostgreSQL HA:

# patroni.yml
scope: botserver-cluster
name: postgres-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: postgres-1:8008

etcd:
  hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 200
        shared_buffers: 2GB

postgresql:
  listen: 0.0.0.0:5432
  connect_address: postgres-1:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    superuser:
      username: postgres
      password: ${POSTGRES_PASSWORD}
    replication:
      username: replicator
      password: ${REPLICATION_PASSWORD}

Cache Failover (Redis Sentinel)

# Redis Sentinel configuration
cache-mode,sentinel
cache-sentinel-master,mymaster
cache-sentinel-nodes,sentinel-1:26379,sentinel-2:26379,sentinel-3:26379

Circuit Breaker

Prevent cascade failures:

# Circuit breaker settings
circuit-breaker-enabled,true
circuit-breaker-threshold,5
circuit-breaker-timeout,30
circuit-breaker-half-open-requests,3

States:

  • Closed: Normal operation
  • Open: Failing, reject requests immediately
  • Half-Open: Testing if service recovered

Graceful Degradation

Configure fallback behavior:

# Fallback configuration
fallback-llm-enabled,true
fallback-llm-provider,local
fallback-llm-model,DeepSeek-R3-Distill-Qwen-1.5B

fallback-cache-enabled,true
fallback-cache-mode,memory

fallback-vectordb-enabled,true
fallback-vectordb-mode,keyword-search

Monitoring Scaling

Metrics Collection

Key metrics to monitor:

# Scaling metrics
metrics-scaling-enabled,true
metrics-container-count,true
metrics-scaling-events,true
metrics-load-distribution,true

Alerting Rules

Configure alerts for scaling issues:

# alerting-rules.yml
groups:
  - name: scaling
    rules:
      - alert: HighCPUUsage
        expr: avg(cpu_usage) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          
      - alert: MaxInstancesReached
        expr: container_count >= max_instances
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Maximum instances reached, cannot scale up"
          
      - alert: ScalingFailed
        expr: scaling_errors > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Scaling operation failed"

Best Practices

Scaling

  1. Start small - Begin with auto-scaling disabled, monitor patterns first
  2. Set appropriate thresholds - Too low causes thrashing, too high causes poor performance
  3. Use cooldown periods - Prevent rapid scale up/down cycles
  4. Test failover - Regularly test your failover procedures
  5. Monitor costs - More instances = higher infrastructure costs

Load Balancing

  1. Use sticky sessions for WebSockets - Required for real-time features
  2. Enable health checks - Remove unhealthy instances automatically
  3. Configure timeouts - Prevent hanging connections
  4. Use connection pooling - Reduce connection overhead

Sharding

  1. Choose the right strategy - Tenant-based is simplest for SaaS
  2. Plan for rebalancing - Have procedures to move data between shards
  3. Avoid cross-shard queries - Design to minimize these
  4. Monitor shard balance - Uneven distribution causes hotspots

Next Steps