Chapter 03: Knowledge Base System
Vector search and semantic retrieval for intelligent document querying.
Overview
The Knowledge Base (gbkb) transforms documents into searchable semantic representations, enabling natural language queries against your organization’s content.
Architecture
The pipeline processes documents through extraction, chunking, embedding, and storage to enable semantic search.
Supported Formats
| Format | Features |
|---|---|
| Text, OCR, tables | |
| DOCX | Formatted text, styles |
| HTML | DOM parsing |
| Markdown | GFM, tables, code |
| CSV/JSON | Structured data |
| TXT | Plain text |
Quick Start
' Activate knowledge base
USE KB "company-docs"
' Bot now answers from your documents
TALK "How can I help you?"
Key Concepts
Document Processing
- Extract - Pull text from files
- Chunk - Split into ~500 token segments
- Embed - Generate vectors (BGE model)
- Store - Save to Qdrant
Semantic Search
- Query converted to vector embedding
- Cosine similarity finds relevant chunks
- Top results injected into LLM context
- No explicit search code needed
Storage Requirements
Vector databases need ~3.5x original document size:
- Embeddings: ~2x
- Indexes: ~1x
- Metadata: ~0.5x
Configuration
name,value
embedding-url,http://localhost:8082
embedding-model,bge-small-en-v1.5
rag-hybrid-enabled,true
rag-top-k,10
Chapter Contents
- KB and Tools System - Integration patterns
- Vector Collections - Collection management
- Document Indexing - Processing pipeline
- Semantic Search - Search mechanics
- Episodic Memory - Conversation history and context management
- Semantic Caching - Performance optimization
See Also
- .gbkb Package - Folder structure
- USE KB Keyword - Keyword reference
- Hybrid Search - RAG 2.0