Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 03: Knowledge Base System

Vector search and semantic retrieval for intelligent document querying.

Overview

The Knowledge Base (gbkb) transforms documents into searchable semantic representations, enabling natural language queries against your organization’s content.

Architecture

KB Architecture Pipeline

The pipeline processes documents through extraction, chunking, embedding, and storage to enable semantic search.

Supported Formats

FormatFeatures
PDFText, OCR, tables
DOCXFormatted text, styles
HTMLDOM parsing
MarkdownGFM, tables, code
CSV/JSONStructured data
TXTPlain text

Quick Start

' Activate knowledge base
USE KB "company-docs"

' Bot now answers from your documents
TALK "How can I help you?"

Key Concepts

Document Processing

  1. Extract - Pull text from files
  2. Chunk - Split into ~500 token segments
  3. Embed - Generate vectors (BGE model)
  4. Store - Save to Qdrant
  • Query converted to vector embedding
  • Cosine similarity finds relevant chunks
  • Top results injected into LLM context
  • No explicit search code needed

Storage Requirements

Vector databases need ~3.5x original document size:

  • Embeddings: ~2x
  • Indexes: ~1x
  • Metadata: ~0.5x

Configuration

name,value
embedding-url,http://localhost:8082
embedding-model,bge-small-en-v1.5
rag-hybrid-enabled,true
rag-top-k,10

Chapter Contents

See Also