Chapter 03: Knowledge Base System

Vector search and semantic retrieval for intelligent document querying.

Overview

The Knowledge Base (gbkb) transforms documents into searchable semantic representations, enabling natural language queries against your organization’s content.

Architecture

The pipeline processes documents through extraction, chunking, embedding, and storage to enable semantic search.

Supported Formats

Format	Features
PDF	Text, OCR, tables
DOCX	Formatted text, styles
HTML	DOM parsing
Markdown	GFM, tables, code
CSV/JSON	Structured data
TXT	Plain text

Quick Start

' Activate knowledge base
USE KB "company-docs"

' Bot now answers from your documents
TALK "How can I help you?"

Key Concepts

Document Processing

Extract - Pull text from files
Chunk - Split into ~500 token segments
Embed - Generate vectors (BGE model)
Store - Save to Qdrant

Semantic Search

Query converted to vector embedding
Cosine similarity finds relevant chunks
Top results injected into LLM context
No explicit search code needed

Storage Requirements

Vector databases need ~3.5x original document size:

Embeddings: ~2x
Indexes: ~1x
Metadata: ~0.5x

Configuration

name,value
embedding-url,http://localhost:8082
embedding-model,bge-small-en-v1.5
rag-hybrid-enabled,true
rag-top-k,10

Chapter Contents

KB and Tools System - Integration patterns
Vector Collections - Collection management
Document Indexing - Processing pipeline
Semantic Search - Search mechanics
Episodic Memory - Conversation history and context management
Semantic Caching - Performance optimization

General Bots Documentation