Document Processing API
BotServer provides RESTful endpoints for processing, extracting, and analyzing various document formats including PDFs, Office documents, and images.
Overview
The Document Processing API enables:
- Text extraction from documents
- OCR for scanned documents
- Metadata extraction
- Document conversion
- Content analysis and summarization
Base URL
http://localhost:8080/api/v1/documents
Authentication
All Document Processing API requests require authentication:
Authorization: Bearer <token>
Endpoints
Upload Document
POST /upload
Upload a document for processing.
Request:
- Method:
POST - Content-Type:
multipart/form-data
Form Data:
file- The document fileprocess_options- JSON string of processing options
Example Request:
curl -X POST \
-H "Authorization: Bearer token123" \
-F "file=@document.pdf" \
-F 'process_options={"extract_text":true,"extract_metadata":true}' \
http://localhost:8080/api/v1/documents/upload
Response:
{
"document_id": "doc_abc123",
"filename": "document.pdf",
"size_bytes": 2048576,
"mime_type": "application/pdf",
"status": "processing",
"uploaded_at": "2024-01-15T10:00:00Z"
}
Process Document
POST /process
Process an already uploaded document.
Request Body:
{
"document_id": "doc_abc123",
"operations": [
"extract_text",
"extract_metadata",
"generate_summary",
"extract_entities"
],
"options": {
"language": "en",
"ocr_enabled": true,
"chunk_size": 1000
}
}
Response:
{
"document_id": "doc_abc123",
"process_id": "prc_xyz789",
"status": "processing",
"estimated_completion": "2024-01-15T10:02:00Z"
}
Get Processing Status
GET /process/{process_id}/status
Check the status of document processing.
Response:
{
"process_id": "prc_xyz789",
"document_id": "doc_abc123",
"status": "completed",
"progress": 100,
"completed_at": "2024-01-15T10:01:30Z",
"results_available": true
}
Get Extracted Text
GET /documents/{document_id}/text
Retrieve extracted text from a processed document.
Query Parameters:
page- Specific page number (optional)format- Output format:plain,markdown,html
Response:
{
"document_id": "doc_abc123",
"text": "This is the extracted text from the document...",
"pages": 10,
"word_count": 5420,
"language": "en"
}
Get Document Metadata
GET /documents/{document_id}/metadata
Retrieve metadata from a document.
Response:
{
"document_id": "doc_abc123",
"metadata": {
"title": "Annual Report 2024",
"author": "John Doe",
"created_date": "2024-01-10T08:00:00Z",
"modified_date": "2024-01-14T16:30:00Z",
"pages": 10,
"producer": "Microsoft Word",
"keywords": ["annual", "report", "finance"],
"custom_properties": {
"department": "Finance",
"confidentiality": "Internal"
}
}
}
Generate Summary
POST /documents/{document_id}/summarize
Generate an AI summary of the document.
Request Body:
{
"type": "abstractive",
"length": "medium",
"focus_areas": ["key_points", "conclusions"],
"language": "en"
}
Response:
{
"document_id": "doc_abc123",
"summary": "This document discusses the annual financial performance...",
"key_points": [
"Revenue increased by 15%",
"New market expansion successful",
"Operating costs reduced"
],
"summary_length": 250
}
Extract Entities
POST /documents/{document_id}/entities
Extract named entities from the document.
Request Body:
{
"entity_types": ["person", "organization", "location", "date", "money"],
"confidence_threshold": 0.7
}
Response:
{
"document_id": "doc_abc123",
"entities": [
{
"text": "John Smith",
"type": "person",
"confidence": 0.95,
"occurrences": 5
},
{
"text": "New York",
"type": "location",
"confidence": 0.88,
"occurrences": 3
},
{
"text": "$1.5 million",
"type": "money",
"confidence": 0.92,
"occurrences": 2
}
]
}
Convert Document
POST /documents/{document_id}/convert
Convert document to another format.
Request Body:
{
"target_format": "pdf",
"options": {
"compress": true,
"quality": "high",
"page_size": "A4"
}
}
Response:
{
"document_id": "doc_abc123",
"converted_id": "doc_def456",
"original_format": "docx",
"target_format": "pdf",
"download_url": "/api/v1/documents/doc_def456/download"
}
Search Within Document
POST /documents/{document_id}/search
Search for text within a document.
Request Body:
{
"query": "revenue growth",
"case_sensitive": false,
"whole_words": false,
"regex": false
}
Response:
{
"document_id": "doc_abc123",
"matches": [
{
"page": 3,
"line": 15,
"context": "...the company achieved significant revenue growth in Q4...",
"position": 1247
},
{
"page": 7,
"line": 8,
"context": "...projecting continued revenue growth for next year...",
"position": 3892
}
],
"total_matches": 2
}
Split Document
POST /documents/{document_id}/split
Split a document into multiple parts.
Request Body:
{
"method": "by_pages",
"pages_per_split": 5
}
Response:
{
"document_id": "doc_abc123",
"parts": [
{
"part_id": "part_001",
"pages": "1-5",
"download_url": "/api/v1/documents/part_001/download"
},
{
"part_id": "part_002",
"pages": "6-10",
"download_url": "/api/v1/documents/part_002/download"
}
],
"total_parts": 2
}
Merge Documents
POST /documents/merge
Merge multiple documents into one.
Request Body:
{
"document_ids": ["doc_abc123", "doc_def456", "doc_ghi789"],
"output_format": "pdf",
"preserve_metadata": true
}
Response:
{
"merged_document_id": "doc_merged_xyz",
"source_count": 3,
"total_pages": 30,
"download_url": "/api/v1/documents/doc_merged_xyz/download"
}
Supported Formats
Input Formats
- Documents: PDF, DOCX, DOC, ODT, RTF, TXT
- Spreadsheets: XLSX, XLS, ODS, CSV
- Presentations: PPTX, PPT, ODP
- Images: PNG, JPG, JPEG, GIF, BMP, TIFF
- Web: HTML, XML, MARKDOWN
Output Formats
- Plain Text
- Markdown
- HTML
- JSON
- CSV (for tabular data)
Processing Options
OCR Options
{
"ocr_enabled": true,
"ocr_language": "eng",
"ocr_engine": "tesseract",
"preprocessing": {
"deskew": true,
"remove_noise": true,
"enhance_contrast": true
}
}
Text Extraction Options
{
"preserve_formatting": false,
"extract_tables": true,
"extract_images": false,
"chunk_text": true,
"chunk_size": 1000,
"chunk_overlap": 100
}
Summary Options
{
"summary_type": "extractive",
"summary_length": "medium",
"bullet_points": true,
"include_keywords": true,
"max_sentences": 5
}
Batch Processing
Submit Batch
POST /batch/process
Process multiple documents in batch.
Request Body:
{
"documents": [
{
"document_id": "doc_001",
"operations": ["extract_text", "summarize"]
},
{
"document_id": "doc_002",
"operations": ["extract_entities"]
}
],
"notify_on_completion": true,
"webhook_url": "https://example.com/webhook"
}
Get Batch Status
GET /batch/{batch_id}/status
Check batch processing status.
Response:
{
"batch_id": "batch_abc123",
"total_documents": 10,
"processed": 7,
"failed": 1,
"pending": 2,
"completion_percentage": 70
}
Error Responses
400 Bad Request
{
"error": "unsupported_format",
"message": "File format .xyz is not supported",
"supported_formats": ["pdf", "docx", "txt"]
}
413 Payload Too Large
{
"error": "file_too_large",
"message": "File size exceeds maximum limit",
"max_size_bytes": 52428800,
"provided_size_bytes": 104857600
}
422 Unprocessable Entity
{
"error": "corrupted_file",
"message": "The document appears to be corrupted and cannot be processed"
}
Webhooks
Configure webhooks to receive processing notifications:
{
"event": "document.processed",
"document_id": "doc_abc123",
"status": "completed",
"results": {
"text_extracted": true,
"summary_generated": true,
"entities_extracted": true
}
}
Rate Limits
| Operation | Limit | Window |
|---|---|---|
| Upload Document | 50/hour | Per user |
| Process Document | 100/hour | Per user |
| Generate Summary | 20/hour | Per user |
| Batch Processing | 5/hour | Per user |
Best Practices
- Preprocess Documents: Clean scanned documents before OCR
- Use Appropriate Formats: Choose the right output format for your use case
- Batch Similar Documents: Process similar documents together for efficiency
- Handle Large Files: Use chunking for large documents
- Cache Results: Store processed results to avoid reprocessing
- Monitor Processing: Use webhooks for long-running operations
Integration Examples
Python Example
import requests
# Upload and process document
with open('document.pdf', 'rb') as f:
response = requests.post(
'http://localhost:8080/api/v1/documents/upload',
headers={'Authorization': 'Bearer token123'},
files={'file': f},
data={'process_options': '{"extract_text": true}'}
)
document_id = response.json()['document_id']
# Get extracted text
text_response = requests.get(
f'http://localhost:8080/api/v1/documents/{document_id}/text',
headers={'Authorization': 'Bearer token123'}
)
print(text_response.json()['text'])
Related APIs
- Storage API - Document storage
- ML API - Advanced text analysis
- Knowledge Base API - Document indexing