Kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 56 formats. Available for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, and C#—or use via CLI, REST API, or MCP server.

🚀 Version 4.0.0 Release Candidate This is a pre-release version. We invite you to test the library and report any issues you encounter. Help us make the stable release better!

Why Kreuzberg

Truly polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#
Production-ready – Battle-tested with comprehensive error handling and validation
56 formats – PDF, Office documents, images, HTML, XML, emails, archives, and more
OCR built-in – Multiple backends (Tesseract, EasyOCR, PaddleOCR) with table extraction support
Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
Memory efficient – Streaming parsers with constant memory usage for multi-GB files

📖 Complete Documentation • 🚀 Installation Guides

Kreuzberg Cloud (Coming Soon)

Don't want to manage Rust infrastructure? Kreuzberg Cloud is a managed document extraction API launching soon.

REST API with async jobs and webhooks
Built-in chunking and embeddings for RAG pipelines
Premium OCR backends for 95%+ accuracy
No infrastructure to maintain

Join the waitlist →

Installation

Python

pip install kreuzberg

Python Documentation →

Ruby

gem install kreuzberg

Ruby Documentation →

TypeScript/Node.js

npm install kreuzberg

TypeScript/Node.js Documentation →

Go

go get github.com/Goldziher/kreuzberg/packages/go/kreuzberg@latest

Build the FFI crate (cargo build -p kreuzberg-ffi --release) and set LD_LIBRARY_PATH/DYLD_FALLBACK_LIBRARY_PATH to target/release so cgo can locate libkreuzberg_ffi.

Go Documentation →

Java

<dependency>
    <groupId>dev.kreuzberg</groupId>
    <artifactId>kreuzberg</artifactId>
    <version>4.0.0-rc.1</version>
</dependency>

Or with Gradle:

implementation 'dev.kreuzberg:kreuzberg:4.0.0-rc.1'

Requires Java 25+ with Foreign Function & Memory API (Panama). Build the FFI crate (cargo build -p kreuzberg-ffi --release) for native library access.

Java Documentation →

C#

dotnet add package Goldziher.Kreuzberg --version 4.0.0-rc.1

Requires .NET 10.0+. Build the FFI crate (cargo build -p kreuzberg-ffi --release) and ensure the native library is accessible.

C# Documentation →

Rust

[dependencies]
# Use git dependency for full feature support (including embeddings)
kreuzberg = { git = "https://github.com/Goldziher/kreuzberg", tag = "v4.0.0" }

# Or use a specific branch
# kreuzberg = { git = "https://github.com/Goldziher/kreuzberg", branch = "main" }

Rust Documentation →

CLI

brew install goldziher/tap/kreuzberg

cargo install kreuzberg-cli

CLI Documentation →

Quick Start

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Python Quick Start → – Installation, basic usage, async/sync APIs
Ruby Quick Start → – Installation, basic usage, configuration
TypeScript/Node.js Quick Start → – Installation, types, promises
Go Quick Start → – Installation, native library setup, sync/async extraction + batch APIs
Java Quick Start → – Installation, FFM API usage, Maven/Gradle setup
C# Quick Start → – Installation, P/Invoke usage, NuGet package
Rust Quick Start → – Crate usage, features, async/sync APIs
CLI Quick Start → – Command-line usage, batch processing, options

Supported Formats

Documents & Productivity

Format	Extensions	Metadata	Tables	Images
PDF	`.pdf`	✅	✅	✅
Word	`.docx`, `.doc`	✅	✅	✅
Excel	`.xlsx`, `.xls`, `.ods`	✅	✅	❌
PowerPoint	`.pptx`, `.ppt`	✅	✅	✅
Rich Text	`.rtf`	✅	❌	❌
EPUB	`.epub`	✅	❌	❌

Images

All image formats support OCR: .jpg, .jpeg, .png, .tiff, .tif, .bmp, .gif, .webp, .jp2

Web & Structured Data

Format	Extensions	Features
HTML	`.html`, `.htm`	Metadata extraction, link preservation
XML	`.xml`	Streaming parser for multi-GB files
JSON	`.json`	Intelligent field detection
YAML	`.yaml`	Structure preservation
TOML	`.toml`	Configuration parsing

Email & Archives

Format	Extensions	Features
Email	`.eml`, `.msg`	Full metadata, attachment extraction
Archives	`.zip`, `.tar`, `.gz`, `.7z`	File listing, metadata

Academic & Technical

LaTeX (.tex), BibTeX (.bib), Jupyter (.ipynb), reStructuredText (.rst), Org Mode (.org), Markdown (.md)

Complete Format Documentation

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation →

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide →

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration →

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide →

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide →

Deployment Options

REST API Server

Production-ready API server with OpenAPI documentation, health checks, and telemetry support. Deploy standalone or in containers with automatic format detection and streaming support.

API Server Documentation →

MCP Server (AI Integration)

Model Context Protocol server for Claude and other AI assistants. Enables AI agents to extract and process documents directly with full configuration support.

MCP Server Documentation →

Docker

Official Docker images available in multiple variants:

Core (~1.0-1.3GB): Tesseract OCR, Pandoc, modern Office formats
Full (~1.5-2.1GB): Adds LibreOffice for legacy Office formats (.doc, .ppt)

All images support API server, CLI, and MCP server modes with automatic platform detection for linux/amd64 and linux/arm64.

Docker Deployment Guide →

Comparison with Alternatives

Feature	Kreuzberg	docling	unstructured	LlamaParse
Formats	56	PDF, DOCX	30+	PDF only
Self-hosted	✅ Yes (MIT)	✅ Yes	✅ Yes	❌ API only
Programming Languages	Rust, Python, Ruby, TS, Java, Go, C#	Python	Python	API (any)
Table Extraction	✅ Good	✅ Good	✅ Basic	✅ Excellent
OCR	✅ Multiple backends	✅ Yes	✅ Yes	✅ Yes
Embeddings	✅ Built-in	❌ No	❌ No	❌ No
Chunking	✅ Built-in	❌ No	✅ Yes	❌ No
Cost	Free (MIT)	Free (MIT)	Free (Apache 2.0)	$0.003/page
Air-gap deployments	✅ Yes	✅ Yes	✅ Yes	❌ No

When to use Kreuzberg:

✅ Need high throughput (thousands of documents)
✅ Memory-constrained environments
✅ Non-Python ecosystems (Ruby, TypeScript, Java, Go)
✅ RAG pipelines (built-in chunking + embeddings)
✅ Self-hosted or air-gapped deployments
✅ Multi-GB files requiring streaming

When to consider alternatives:

LlamaParse: If you need best-in-class table extraction and only process PDFs (requires internet, paid)
docling: If you're Python-only and don't need extreme performance
unstructured: If you need extensive pre-built integrations with vector databases

Architecture

Kreuzberg is built with a Rust core for efficient document extraction and processing.

Design Principles

Rust core – Native code for text extraction and processing
Async throughout – Asynchronous processing with Tokio runtime
Memory efficient – Streaming parsers for large files
Parallel batch processing – Configurable concurrency for multiple documents
Zero-copy operations – Efficient data handling where possible

Documentation

Installation Guide – Setup and dependencies
User Guide – Comprehensive usage guide
API Reference – Complete API documentation
Format Support – Supported file formats
OCR Backends – OCR engine setup
CLI Guide – Command-line usage
Migration Guide – Upgrading from v3

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,485 Commits
.cargo		.cargo
.github		.github
crates		crates
docker		docker
docs		docs
e2e		e2e
examples		examples
fixtures		fixtures
packages		packages
scripts		scripts
test_documents		test_documents
tools		tools
v3		v3
vendor/rb-sys		vendor/rb-sys
.commitlintrc		.commitlintrc
.deepsource.toml		.deepsource.toml
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.golangci.yml		.golangci.yml
.markdownlint.yaml		.markdownlint.yaml
.mcp.json		.mcp.json
.npmrc		.npmrc
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
ATTRIBUTION.md		ATTRIBUTION.md
ATTRIBUTIONS.md		ATTRIBUTIONS.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
Taskfile.yaml		Taskfile.yaml
ai-rulez.yaml		ai-rulez.yaml
biome.json		biome.json
claude.json		claude.json
kreuzberg.iml		kreuzberg.iml
mkdocs.yaml		mkdocs.yaml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml
rustfmt.toml		rustfmt.toml
uv.lock		uv.lock

License

Goldziher/kreuzberg

Folders and files

Latest commit

History

Repository files navigation

Kreuzberg

Why Kreuzberg

Kreuzberg Cloud (Coming Soon)

Installation

Python

Ruby

TypeScript/Node.js

Go

Java

C#

Rust

CLI

Quick Start

Supported Formats

Documents & Productivity

Images

Web & Structured Data

Email & Archives

Academic & Technical

Key Features

OCR with Table Extraction

Batch Processing

Password-Protected PDFs

Language Detection

Metadata Extraction

Deployment Options

REST API Server

MCP Server (AI Integration)

Docker

Comparison with Alternatives

Architecture

Design Principles

Documentation

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 68

Packages 0

Uh oh!

Contributors 14

Uh oh!

Languages

Packages