A Python FastAPI/FastMCP 2.0 application that intelligently parses PDF documents and transforms them into various formats. The application evaluates PDF quality to determine whether to use direct text extraction or OCR, then normalizes content into a standard structure before exporting to your chosen format.
- Smart PDF Processing: Automatically evaluates PDF quality (text-based, scanned, or mixed) to choose the optimal extraction method
- Multiple Export Formats: Support for plain text, markdown, CSV, XML, and JSON
- OCR Support: Handles scanned PDFs using Tesseract OCR
- FastAPI REST API: Modern, high-performance API with automatic documentation
- FastMCP 2.0 Integration: Exposes PDF processing tools via Model Context Protocol
- Docker Support: Fully containerized for easy deployment
- Metadata Preservation: Extracts and includes document metadata
The application consists of several core components:
- Quality Evaluator: Analyzes PDFs to determine processing strategy
- Text Extractor: Extracts text from text-based PDFs using pdfplumber
- OCR Processor: Processes scanned PDFs using Tesseract OCR and pdf2image
- Content Normalizer: Standardizes parsed content structure
- Format Exporter: Converts content to requested output format
- Clone the repository:
git clone https://github.com/SteynSean11/pdf-parse-transform.git
cd pdf-parse-transform- Build and run with docker-compose:
docker-compose up -dThe FastAPI service will be available at http://localhost:8000
- Install system dependencies (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng poppler-utils- Install Python dependencies:
pip install -r requirements.txt- Run the FastAPI server:
uvicorn app.main:app --host 0.0.0.0 --port 8000Endpoint: POST /api/v1/parse
Parameters:
file: PDF file (multipart/form-data)output_format: Output format (plain_text, markdown, csv, xml, json) - default: jsonforce_ocr: Force OCR processing - default: falseinclude_metadata: Include metadata in output - default: true
Example using curl:
curl -X POST "http://localhost:8000/api/v1/parse" \
-F "[email protected]" \
-F "output_format=markdown" \
-F "include_metadata=true"Example using Python:
import requests
with open('document.pdf', 'rb') as f:
files = {'file': f}
data = {
'output_format': 'json',
'force_ocr': False,
'include_metadata': True
}
response = requests.post('http://localhost:8000/api/v1/parse', files=files, data=data)
result = response.json()
print(result['content'])Endpoint: GET /api/v1/formats
curl "http://localhost:8000/api/v1/formats"Endpoint: GET /api/v1/health
curl "http://localhost:8000/api/v1/health"The application provides MCP-compatible tool implementations that can be used programmatically:
Available Tools:
- parse_pdf: Parse a PDF and return content in specified format
- list_supported_formats: Get list of available output formats
- evaluate_pdf_quality: Assess PDF quality without full processing
Example using the tool implementations:
import base64
from app.mcp_server import PDFMCPTools
# Initialize tools
tools = PDFMCPTools()
# Encode PDF
with open('document.pdf', 'rb') as f:
pdf_base64 = base64.b64encode(f.read()).decode()
# Use the parse_pdf tool
result = tools.parse_pdf(
pdf_base64=pdf_base64,
output_format='markdown',
force_ocr=False,
include_metadata=True
)
print(result['content'])Note: The MCP server library (mcp==1.10.0) currently has compatibility issues with Python 3.12. The tool implementations are available and fully functional, but the stdio MCP server is not running. Use the FastAPI REST API for full functionality, or use the PDFMCPTools class directly in your Python code.
Interactive API documentation is available at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
Raw text content extracted from the PDF.
Formatted markdown with headers, metadata section, and page-by-page content.
Comma-separated values with page numbers and content.
Structured XML with document hierarchy, metadata, and page elements.
Complete JSON representation including pages, metadata, and full text content.
pytest tests/ -vpdf-parse-transform/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ ├── mcp_server.py # FastMCP server
│ ├── api/
│ │ ├── __init__.py
│ │ └── routes.py # API endpoints
│ ├── core/
│ │ ├── __init__.py
│ │ ├── pdf_processor.py # Main orchestrator
│ │ ├── quality_evaluator.py # PDF quality assessment
│ │ ├── text_extractor.py # Text extraction
│ │ ├── ocr_processor.py # OCR processing
│ │ ├── content_normalizer.py # Content normalization
│ │ └── format_exporter.py # Format conversion
│ ├── models/
│ │ ├── __init__.py
│ │ └── schemas.py # Pydantic models
│ └── utils/
│ └── __init__.py
├── tests/
│ ├── __init__.py
│ ├── test_api.py
│ ├── test_models.py
│ └── test_format_exporter.py
├── docker/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── pyproject.toml
└── README.md
- Python 3.9+
- Tesseract OCR
- Poppler utilities
- See
requirements.txtfor Python dependencies
MIT License - see LICENSE file for details
Contributions are welcome! Please feel free to submit a Pull Request.