Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Document Indexing

Documents in .gbkb folders are indexed automatically. No manual configuration required.

Automatic Triggers

Indexing occurs when:

  • Files added to .gbkb folders
  • Files modified or updated
  • USE KB called for a collection
  • USE WEBSITE registers URLs for crawling

Processing Pipeline

Document → Extract Text → Chunk → Embed → Store in Qdrant
StageDescription
ExtractPull text from PDF, DOCX, HTML, MD, TXT, CSV
ChunkSplit into ~500 token segments with 50 token overlap
EmbedGenerate vectors using BGE model
StoreSave to Qdrant with metadata

Supported File Types

FormatNotes
PDFFull text extraction, OCR for scanned docs
DOCXMicrosoft Word documents
TXT/MDPlain text and Markdown
HTMLWeb pages (text only)
CSV/JSONStructured data

Website Indexing

Schedule regular crawls for web content:

SET SCHEDULE "0 2 * * *"  ' Daily at 2 AM
USE WEBSITE "https://docs.example.com"

Schedule Examples

PatternFrequency
"0 * * * *"Hourly
"*/30 * * * *"Every 30 minutes
"0 0 * * 0"Weekly (Sunday)
"0 0 1 * *"Monthly (1st)

Configuration

In config.csv:

name,value
embedding-url,http://localhost:8082
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf

Using Indexed Content

USE KB "documentation"
' All documents now searchable
' LLM uses this knowledge automatically

Troubleshooting

IssueSolution
Documents not foundCheck file is in .gbkb folder, verify USE KB called
Slow indexingLarge PDFs take time; consider splitting documents
Outdated contentSet up scheduled crawls for web content

See Also