PDF Parsing Pipeline with Docker, Tika, Celery, RabbitMQ, and Redis

This project is to gain a hands on exprience with some of the most important tools in software. It is a fully containerized pipeline for extracting text from PDF files asynchronously.
It uses FastAPI for the API, Celery for background task processing, RabbitMQ as the message broker, Redis for storing results, and Apache Tika for PDF parsing.

Features

Upload single or multiple PDF files
Asynchronous processing (non-blocking API)
Text extraction via Apache Tika
Task status and result retrieval
Scalable worker architecture

Setup & Run

Prerequisites

Docker installed
Docker Compose installed

Steps

Clone this repository
Go to the pdf-parsing-pipeline directory
docker-compose up --build

That's it! You're good to go!

Project Workflow

First, you should upload your pdf vie either terminal or the API Docs, API saves file(s) in a shared volume. After that, API sends task to RabbitMQ with file path and RabbitMQ holds the task until a worker is ready. A celery worker then reads the file, sends it to Apache Tika for parsing and saves extracted text + status in Redis at the end. You can check the status of proccessing tasks and see the results by provided commands. The diagram for the workflow is as follow:

        ┌───────────┐         HTTP          ┌────────────┐
        │  Client   │ ───────────────────▶ │  FastAPI   │
        └───────────┘                      │   API      │
              ▲                             └─────┬──────┘
              │                                   │
              │                                   ▼
              │                            ┌────────────┐
              │                            │  Uploads   │
              │                            │   Folder   │
              │                            └─────┬──────┘
              │                                   │
              │                                   ▼
        ┌─────────────┐     Task Queue     ┌─────────────┐
        │ RabbitMQ    │ ─────────────────▶ │  Celery     │
        │   Queue     │                    │  Worker     │
        └─────────────┘                    └─────┬───────┘
                                                  │
                                                  ▼
                                            ┌────────────┐
                                            │   Tika     │
                                            │  Server    │
                                            └─────┬──────┘
                                                  │
                                                  ▼
                                            ┌────────────┐
                                            │   Redis    │
                                            │   Store    │
                                            └────────────┘

Testing the System End-to-End

Uplaod the file:

curl -F "file=@/path/to/file.pdf" http://localhost:8000/upload

You get the response:

{"task_id": "123abc"}

You can check status by:

curl http://localhost:8000/status/123abc

Once the status us "SUCCESS", get the result:

curl http://localhost:8000/result/123abc

Take Aways

The main goal of this project was for me to learn and get hands-on experience with the different tools and technologies involved in building an asynchronous PDF parsing pipeline. Here’s a summary of what I learned.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
app		app
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
takeaways.md		takeaways.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Parsing Pipeline with Docker, Tika, Celery, RabbitMQ, and Redis

Features

Setup & Run

Prerequisites

Steps

Project Workflow

Testing the System End-to-End

Take Aways

About

Uh oh!

Releases

Packages

Languages

License

ranarokni/PDF-Parsing-Pipeline

Folders and files

Latest commit

History

Repository files navigation

PDF Parsing Pipeline with Docker, Tika, Celery, RabbitMQ, and Redis

Features

Setup & Run

Prerequisites

Steps

Project Workflow

Testing the System End-to-End

Take Aways

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages