Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Multimodal Configuration

General Bots integrates with botmodels—a Python service for multimodal AI tasks—to enable image generation, video creation, audio synthesis, and vision capabilities directly from BASIC scripts.

General Bots

Architecture

┌─────────────┐     HTTPS      ┌─────────────┐
│  botserver  │ ────────────▶  │  botmodels  │
│   (Rust)    │                │  (Python)   │
└─────────────┘                └─────────────┘
      │                              │
      │ BASIC Keywords               │ AI Models
      │ - IMAGE                      │ - Stable Diffusion
      │ - VIDEO                      │ - Zeroscope
      │ - AUDIO                      │ - TTS/Whisper
      │ - SEE                        │ - BLIP2

When a BASIC script calls a multimodal keyword, botserver forwards the request to botmodels, which runs the appropriate AI model and returns the generated content.

Configuration

Add these settings to your bot’s config.csv file to enable multimodal capabilities.

BotModels Service

KeyDefaultDescription
botmodels-enabledfalseEnable botmodels integration
botmodels-host0.0.0.0Host address for botmodels service
botmodels-port8085Port for botmodels service
botmodels-api-keyAPI key for authentication
botmodels-httpsfalseUse HTTPS for connection

Image Generation

KeyDefaultDescription
image-generator-modelPath to image generation model
image-generator-steps4Inference steps (more = higher quality, slower)
image-generator-width512Output image width in pixels
image-generator-height512Output image height in pixels
image-generator-gpu-layers20Layers to offload to GPU
image-generator-batch-size1Batch size for generation

Video Generation

KeyDefaultDescription
video-generator-modelPath to video generation model
video-generator-frames24Number of frames to generate
video-generator-fps8Output frames per second
video-generator-width320Output video width in pixels
video-generator-height576Output video height in pixels
video-generator-gpu-layers15Layers to offload to GPU
video-generator-batch-size1Batch size for generation

Example Configuration

key,value
botmodels-enabled,true
botmodels-host,0.0.0.0
botmodels-port,8085
botmodels-api-key,your-secret-key
botmodels-https,false
image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf
image-generator-steps,4
image-generator-width,512
image-generator-height,512
image-generator-gpu-layers,20
video-generator-model,../../../../data/diffusion/zeroscope_v2_576w
video-generator-frames,24
video-generator-fps,8

BASIC Keywords

Once configured, these keywords become available in your scripts.

IMAGE

Generate an image from a text prompt:

file = IMAGE "a sunset over mountains with purple clouds"
SEND FILE TO user, file

The keyword returns a path to the generated image file.

VIDEO

Generate a video from a text prompt:

file = VIDEO "a rocket launching into space"
SEND FILE TO user, file

Video generation is more resource-intensive than image generation. Expect longer processing times.

AUDIO

Generate speech audio from text:

file = AUDIO "Hello, welcome to our service!"
SEND FILE TO user, file

SEE

Analyze an image or video and get a description:

' Describe an image
caption = SEE "/path/to/image.jpg"
TALK caption

' Describe a video
description = SEE "/path/to/video.mp4"
TALK description

The SEE keyword uses vision models to understand visual content and return natural language descriptions.

Starting BotModels

Before using multimodal features, start the botmodels service:

cd botmodels
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085

For production with HTTPS:

python -m uvicorn src.main:app \
    --host 0.0.0.0 \
    --port 8085 \
    --ssl-keyfile key.pem \
    --ssl-certfile cert.pem

BotModels API Endpoints

The botmodels service exposes these REST endpoints:

EndpointMethodDescription
/api/image/generatePOSTGenerate image from prompt
/api/video/generatePOSTGenerate video from prompt
/api/speech/generatePOSTGenerate speech from text
/api/speech/totextPOSTTranscribe audio to text
/api/vision/describePOSTDescribe an image
/api/vision/describe_videoPOSTDescribe a video
/api/vision/vqaPOSTVisual question answering
/api/healthGETHealth check

All endpoints except /api/health require the X-API-Key header for authentication.

Model Paths

Configure model paths relative to the botmodels service directory. Typical layout:

data/
├── diffusion/
│   ├── sd_turbo_f16.gguf          # Stable Diffusion
│   └── zeroscope_v2_576w/         # Zeroscope video
├── tts/
│   └── model.onnx                 # Text-to-speech
├── whisper/
│   └── model.bin                  # Speech-to-text
└── vision/
    └── blip2/                     # Vision model

GPU Acceleration

Both image and video generation benefit significantly from GPU acceleration. Configure GPU layers based on your hardware:

GPU VRAMRecommended GPU Layers
4GB8-12
8GB15-20
12GB+25-35

Lower GPU layers if you experience out-of-memory errors.

Troubleshooting

“BotModels is not enabled”

Set botmodels-enabled=true in your config.csv.

Connection refused

Verify botmodels service is running and check host/port configuration. Test connectivity:

curl http://localhost:8085/api/health

Authentication failed

Ensure botmodels-api-key in config.csv matches the API_KEY environment variable in botmodels.

Model not found

Verify model paths are correct and models are downloaded to the expected locations.

Out of memory

Reduce gpu-layers or batch-size. Video generation is particularly memory-intensive.

Security Considerations

Use HTTPS in production. Set botmodels-https=true and configure SSL certificates on the botmodels service.

Use strong API keys. Generate cryptographically random keys for the botmodels-api-key setting.

Restrict network access. Limit botmodels service access to trusted hosts only.

Consider GPU isolation. Run botmodels on a dedicated GPU server if sharing resources with other services.

Performance Tips

Image generation runs fastest with SD Turbo models and 4-8 inference steps. More steps improve quality but increase generation time linearly.

Video generation is the most resource-intensive operation. Keep frame counts low (24-48) for reasonable response times.

Batch processing improves throughput when generating multiple items. Increase batch-size if you have sufficient GPU memory.

Caching generated content when appropriate. If multiple users request similar content, consider storing results.

See Also