Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Document Processing API

BotServer provides RESTful endpoints for processing, extracting, and analyzing various document formats including PDFs, Office documents, and images.

Overview

The Document Processing API enables:

  • Text extraction from documents
  • OCR for scanned documents
  • Metadata extraction
  • Document conversion
  • Content analysis and summarization

Base URL

http://localhost:8080/api/v1/documents

Authentication

All Document Processing API requests require authentication:

Authorization: Bearer <token>

Endpoints

Upload Document

POST /upload

Upload a document for processing.

Request:

  • Method: POST
  • Content-Type: multipart/form-data

Form Data:

  • file - The document file
  • process_options - JSON string of processing options

Example Request:

curl -X POST \
  -H "Authorization: Bearer token123" \
  -F "file=@document.pdf" \
  -F 'process_options={"extract_text":true,"extract_metadata":true}' \
  http://localhost:8080/api/v1/documents/upload

Response:

{
  "document_id": "doc_abc123",
  "filename": "document.pdf",
  "size_bytes": 2048576,
  "mime_type": "application/pdf",
  "status": "processing",
  "uploaded_at": "2024-01-15T10:00:00Z"
}

Process Document

POST /process

Process an already uploaded document.

Request Body:

{
  "document_id": "doc_abc123",
  "operations": [
    "extract_text",
    "extract_metadata",
    "generate_summary",
    "extract_entities"
  ],
  "options": {
    "language": "en",
    "ocr_enabled": true,
    "chunk_size": 1000
  }
}

Response:

{
  "document_id": "doc_abc123",
  "process_id": "prc_xyz789",
  "status": "processing",
  "estimated_completion": "2024-01-15T10:02:00Z"
}

Get Processing Status

GET /process/{process_id}/status

Check the status of document processing.

Response:

{
  "process_id": "prc_xyz789",
  "document_id": "doc_abc123",
  "status": "completed",
  "progress": 100,
  "completed_at": "2024-01-15T10:01:30Z",
  "results_available": true
}

Get Extracted Text

GET /documents/{document_id}/text

Retrieve extracted text from a processed document.

Query Parameters:

  • page - Specific page number (optional)
  • format - Output format: plain, markdown, html

Response:

{
  "document_id": "doc_abc123",
  "text": "This is the extracted text from the document...",
  "pages": 10,
  "word_count": 5420,
  "language": "en"
}

Get Document Metadata

GET /documents/{document_id}/metadata

Retrieve metadata from a document.

Response:

{
  "document_id": "doc_abc123",
  "metadata": {
    "title": "Annual Report 2024",
    "author": "John Doe",
    "created_date": "2024-01-10T08:00:00Z",
    "modified_date": "2024-01-14T16:30:00Z",
    "pages": 10,
    "producer": "Microsoft Word",
    "keywords": ["annual", "report", "finance"],
    "custom_properties": {
      "department": "Finance",
      "confidentiality": "Internal"
    }
  }
}

Generate Summary

POST /documents/{document_id}/summarize

Generate an AI summary of the document.

Request Body:

{
  "type": "abstractive",
  "length": "medium",
  "focus_areas": ["key_points", "conclusions"],
  "language": "en"
}

Response:

{
  "document_id": "doc_abc123",
  "summary": "This document discusses the annual financial performance...",
  "key_points": [
    "Revenue increased by 15%",
    "New market expansion successful",
    "Operating costs reduced"
  ],
  "summary_length": 250
}

Extract Entities

POST /documents/{document_id}/entities

Extract named entities from the document.

Request Body:

{
  "entity_types": ["person", "organization", "location", "date", "money"],
  "confidence_threshold": 0.7
}

Response:

{
  "document_id": "doc_abc123",
  "entities": [
    {
      "text": "John Smith",
      "type": "person",
      "confidence": 0.95,
      "occurrences": 5
    },
    {
      "text": "New York",
      "type": "location",
      "confidence": 0.88,
      "occurrences": 3
    },
    {
      "text": "$1.5 million",
      "type": "money",
      "confidence": 0.92,
      "occurrences": 2
    }
  ]
}

Convert Document

POST /documents/{document_id}/convert

Convert document to another format.

Request Body:

{
  "target_format": "pdf",
  "options": {
    "compress": true,
    "quality": "high",
    "page_size": "A4"
  }
}

Response:

{
  "document_id": "doc_abc123",
  "converted_id": "doc_def456",
  "original_format": "docx",
  "target_format": "pdf",
  "download_url": "/api/v1/documents/doc_def456/download"
}

Search Within Document

POST /documents/{document_id}/search

Search for text within a document.

Request Body:

{
  "query": "revenue growth",
  "case_sensitive": false,
  "whole_words": false,
  "regex": false
}

Response:

{
  "document_id": "doc_abc123",
  "matches": [
    {
      "page": 3,
      "line": 15,
      "context": "...the company achieved significant revenue growth in Q4...",
      "position": 1247
    },
    {
      "page": 7,
      "line": 8,
      "context": "...projecting continued revenue growth for next year...",
      "position": 3892
    }
  ],
  "total_matches": 2
}

Split Document

POST /documents/{document_id}/split

Split a document into multiple parts.

Request Body:

{
  "method": "by_pages",
  "pages_per_split": 5
}

Response:

{
  "document_id": "doc_abc123",
  "parts": [
    {
      "part_id": "part_001",
      "pages": "1-5",
      "download_url": "/api/v1/documents/part_001/download"
    },
    {
      "part_id": "part_002",
      "pages": "6-10",
      "download_url": "/api/v1/documents/part_002/download"
    }
  ],
  "total_parts": 2
}

Merge Documents

POST /documents/merge

Merge multiple documents into one.

Request Body:

{
  "document_ids": ["doc_abc123", "doc_def456", "doc_ghi789"],
  "output_format": "pdf",
  "preserve_metadata": true
}

Response:

{
  "merged_document_id": "doc_merged_xyz",
  "source_count": 3,
  "total_pages": 30,
  "download_url": "/api/v1/documents/doc_merged_xyz/download"
}

Supported Formats

Input Formats

  • Documents: PDF, DOCX, DOC, ODT, RTF, TXT
  • Spreadsheets: XLSX, XLS, ODS, CSV
  • Presentations: PPTX, PPT, ODP
  • Images: PNG, JPG, JPEG, GIF, BMP, TIFF
  • Web: HTML, XML, MARKDOWN

Output Formats

  • PDF
  • Plain Text
  • Markdown
  • HTML
  • JSON
  • CSV (for tabular data)

Processing Options

OCR Options

{
  "ocr_enabled": true,
  "ocr_language": "eng",
  "ocr_engine": "tesseract",
  "preprocessing": {
    "deskew": true,
    "remove_noise": true,
    "enhance_contrast": true
  }
}

Text Extraction Options

{
  "preserve_formatting": false,
  "extract_tables": true,
  "extract_images": false,
  "chunk_text": true,
  "chunk_size": 1000,
  "chunk_overlap": 100
}

Summary Options

{
  "summary_type": "extractive",
  "summary_length": "medium",
  "bullet_points": true,
  "include_keywords": true,
  "max_sentences": 5
}

Batch Processing

Submit Batch

POST /batch/process

Process multiple documents in batch.

Request Body:

{
  "documents": [
    {
      "document_id": "doc_001",
      "operations": ["extract_text", "summarize"]
    },
    {
      "document_id": "doc_002",
      "operations": ["extract_entities"]
    }
  ],
  "notify_on_completion": true,
  "webhook_url": "https://example.com/webhook"
}

Get Batch Status

GET /batch/{batch_id}/status

Check batch processing status.

Response:

{
  "batch_id": "batch_abc123",
  "total_documents": 10,
  "processed": 7,
  "failed": 1,
  "pending": 2,
  "completion_percentage": 70
}

Error Responses

400 Bad Request

{
  "error": "unsupported_format",
  "message": "File format .xyz is not supported",
  "supported_formats": ["pdf", "docx", "txt"]
}

413 Payload Too Large

{
  "error": "file_too_large",
  "message": "File size exceeds maximum limit",
  "max_size_bytes": 52428800,
  "provided_size_bytes": 104857600
}

422 Unprocessable Entity

{
  "error": "corrupted_file",
  "message": "The document appears to be corrupted and cannot be processed"
}

Webhooks

Configure webhooks to receive processing notifications:

{
  "event": "document.processed",
  "document_id": "doc_abc123",
  "status": "completed",
  "results": {
    "text_extracted": true,
    "summary_generated": true,
    "entities_extracted": true
  }
}

Rate Limits

OperationLimitWindow
Upload Document50/hourPer user
Process Document100/hourPer user
Generate Summary20/hourPer user
Batch Processing5/hourPer user

Best Practices

  1. Preprocess Documents: Clean scanned documents before OCR
  2. Use Appropriate Formats: Choose the right output format for your use case
  3. Batch Similar Documents: Process similar documents together for efficiency
  4. Handle Large Files: Use chunking for large documents
  5. Cache Results: Store processed results to avoid reprocessing
  6. Monitor Processing: Use webhooks for long-running operations

Integration Examples

Python Example

import requests

# Upload and process document
with open('document.pdf', 'rb') as f:
    response = requests.post(
        'http://localhost:8080/api/v1/documents/upload',
        headers={'Authorization': 'Bearer token123'},
        files={'file': f},
        data={'process_options': '{"extract_text": true}'}
    )
    
document_id = response.json()['document_id']

# Get extracted text
text_response = requests.get(
    f'http://localhost:8080/api/v1/documents/{document_id}/text',
    headers={'Authorization': 'Bearer token123'}
)

print(text_response.json()['text'])